A statistician is an accountant without the charisma.


Chapter 4
    Sec 4.3  

Some variables like sex, race, and occupation are inherently categorical.  Other categorical variables are created by grouping values of a quantitative variable into classes.  Published data are often reported in grouped form to save space.  To analyze categorical data, we use the counts or percents of individuals that fall into various categories. Raw data is often presented in a two-way table because it describes two categorical variables...one is a row variable and one is a column variable (similar to a matrix design).  The entries in the table are the counts.

First look at the distribution of each variable separately.  The distribution of a categorical variable just says how often each outcome occurred.  The "Total" column at the right of the table contains the totals for each of the rows.  These row totals give the distribution for the row variable.   Do the same for the columns.  IF the row and column totals are missing, your first order of business is to fill them in.  The distributions in the totals rows and columns are called "marginal distributions" because they are in the margins of the table.  Sometimes minor errors in calculation are observed in the totals due to round-off error.  We can use a bar graph or pie chart to display marginal distributions.  In working with two-way tables you must calculate lots of percents.   A two-way table contains a great deal of information in compact form.  Making the information clear almost always required finding percents.

To describe relationships among categorical variables, calculate appropriate percents from the counts given.  Although graphs are NOT as useful for describing categorical variables as they are for quantitative variables, a graph still helps an audience grasp the data quickly.  Although bar graphs look a bit like histograms, their details and uses are different (recall Chapter 1).  A histogram shows the distribution of the values of a quantitative variable with numerically scaled axes.  A bar graph compares the sizes of different items.  The horizontal axis of a bar graph need not have any measurement scale but may simply identify the items by name.

Conditional distributions are the percents for each entry in a row (if that row described a condition) or the percents in a column.  The percents in each column add to100%.   Statistical software can speed the task of finding each entry in a two-way table in percent form as a percent of its column..  It can also calculate row percents and totals. Each conditional distribution could be turned into side-by-side bar graphs.  Hint:  Compare the conditional distributions of the response variable for the separate values of the explanatory variable. 

As is the case with quantitative variables, the effects of lurking variables can change or even reverse relationships between two categorical variables.  Surprises can await the unsuspecting user of data.  See Ex. 4.23 (page 249).  Comparison or an association when data from several groups are combined to form a single data set.   Simpson's Paradox refers to the reversal of the direction of a comparison or an association when data from several groups are combined to form a single group.  It is an extreme form of how lurking variables can be misleading.