A statistician is an accountant without the charisma.

Chapter 4

Sec 4.3

Some
variables like sex, race, and occupation are inherently categorical. Other
categorical variables are created by grouping values of a quantitative variable
into classes. Published data are often reported in grouped form to save
space. To analyze categorical data, we use the *counts *or *percents*
of individuals that fall into various categories. Raw data is often presented in
a two-way table because it describes two categorical variables...one is a row
variable and one is a column variable (similar to a matrix design). The
entries in the table are the counts.

First
look at the distribution of each variable separately. The distribution of
a categorical variable just says how often each outcome occurred. The
"Total" column at the right of the table contains the totals for each
of the rows. These row totals give the distribution for the row
variable. Do the same for the columns. IF the row and column
totals are missing, your first order of business is to fill them in. The
distributions in the totals rows and columns are called *"marginal
distributions"* because they are in the margins of the table.
Sometimes minor errors in calculation are observed in the totals due to
round-off error. We can use a bar graph or pie chart to display marginal
distributions. In working with two-way tables you must calculate lots of
percents. A two-way table contains a great deal of
information in compact form. Making the information clear almost always
required finding percents.

__ To
describe relationships among categorical variables, calculate appropriate
percents from the counts given.__ Although graphs are NOT as useful
for describing categorical variables as they are for quantitative variables, a
graph still helps an audience grasp the data quickly. Although bar graphs
look a bit like histograms, their details and uses are different (recall Chapter
1). A histogram shows the distribution of the values of a quantitative
variable with numerically scaled axes. A bar graph compares the sizes of different items. The
horizontal axis of a bar graph need not have any measurement scale but may
simply identify the items by name.

*Conditional distributions *are the percents for __each__ entry in a
row (if that row described a condition) or the percents in a column. The
percents in each column add to100%. Statistical software can speed
the task of finding each entry in a two-way table in percent form as a percent
of its column.. It can also calculate row percents and totals. Each
conditional distribution could be turned into side-by-side bar graphs.
Hint: Compare the conditional distributions of the response variable for
the separate values of the explanatory variable.

As is the case with quantitative variables, the effects of lurking variables can change or even reverse relationships between two categorical variables. Surprises can await the unsuspecting user of data. See Ex. 4.23 (page 249). Comparison or an association when data from several groups are combined to form a single data set. Simpson's Paradox refers to the reversal of the direction of a comparison or an association when data from several groups are combined to form a single group. It is an extreme form of how lurking variables can be misleading.