"A single death is a tragedy,
a million deaths is a statistic. "
Joseph Stalin
Chapter 1
Sec. 1.2 Notes:Data display VARIABILITY. (Recall: data is a plural term... use it with the correct verb...or you will lose credibility.)
The "pattern" of the variability is the distribution.You may display data
1. visually through graphs
2. by summarizing them numerically, or
3. by describing them verbally.
Every math course strives to encourage mathematical communication through the same methods...words, pictures, and symbols. Statistics is no different.
It is a good idea to use more than one descriptive method including
tabular (frequency distribution, and relative and cumulative frequency distributions)
graphical showing overall pattern, center, spread, shape (symmetric, left/right skewed) and
numerical.Ponder this...."Who is baseball's greatest homerun hitter?"
This section will give us the tools we need to analyze data on potential candidates and formulate a response.
To describe a distribution with numbers we consider shape, center, and spread. These three characteristics vie a good description of the overall pattern. We already know about shape: symmetric or skewed. The most common measure of center is our ordinary arithmetic average (MEAN).
To find the MEAN of a set of observations, add all the values together and divide by the number of observations where observations are names x_{1}, x_{2}, x_{3}, ..., x_{n }x bar = x_{1}+ x_{2}+ x_{3}+ ...+ x_{n }_{n OR }x bar = (1/n) S x_{i } |
The S (capital Greek sigma) is short for "add 'em all up." The x_{i } implies subscripts whose only use is to differentiate the observations. The bar over the x indicates the mean of all the x values (say "x-bar").
The mean is sensitive to the influence of a few extreme observations so use with a symmetric distribution is desirable. When the distribution is skewed the mean will be pulled toward the long tail. Thus, the MEAN IS NOT A RESISTANT MEASURE OF CENTER. The MEAN uses the actual value of each observation and will "chase" a single large observation upward. Another measure is needed .
The median M is the midpoint of a distribution, the number such that half the observations are smaller and half are larger. You may find the median by hand for small data sets by arranging the values in order and finding the midpoint. If n is odd, the median M is the center observation in the ordered list. If n is even, the median is the average of the two center observations. Finding the median for large data sets should be left to the calculator or computer. The median is not affected by outliers, therefore the MEDIAN IS A RESISTANT MEASURE OF CENTER.
For a symmetric distribution, the MEAN and MEDIAN are close together. In a skewed distribution, the mean is farther out in the long tail than the median. Reports about home prices, incomes, and other strongly skewed distributions usually give the median. The mean and median measure CENTER in different ways and both are useful. Both measures can be found using the TI-83 calculator.
Enter the data into a list using STAT (Edit)
Press 2nd STAT >>Math
Select 3 for mean or whatever measure you desire.NOTE: using STAT (Calc) 1 - Var Stats , we can find all the important stats from a specific data set
Using only one measure of the center of a distribution can be misleading. We are also interested in finding the spread or variability. Spread can be found by calculating the RANGE (subtract the min data point from the max data point) The description of spread can also be improved by considering QUARTILES. Q_{1} is the median of the lower half, Q_{2} is the median itself, Q_{3} is the median of the upper half. Q_{1} is larger than 25% of the data, Q_{2} is greater than 50% of the data, Q_{3} is greater than 75% of the data._{ }The IQR (Interquartile Range) is the distance between the first and third quartiles IQR = Q_{3 } - Q_{1}. If an observation falls between Q_{1 }and Q_{3}, then it is not unusually high or low. The IQR is the basis of a "rule of thumb" for identifying suspected OUTLIERS._{ }The formula is 1.5 times IQR....call an observation an outlier IF it falls more than 1.5 X IQR above Q_{3} or below Q_{1}.
The smallest and largest observations also tell about the distribution. Combining all five numbers we get a good summary of center and spread. FIVE NUMBER SUMMARY consists of the minimum data point, Q_{1}, median, Q_{3}, and maximum data point. This summary leads to a new type of graph....BOX PLOT. Showing less detail than histograms or stem plots, box plots are be used for side-by-side comparisons of more than one distribution. These plots can be horizontal or vertical. Box plots give an indication of the symmetry or skewness of a distribution...in a symmetric distribution, Q1 and Q3 are equidistant from the median. Since a regular box plot conceals outliers, we will accept the use of the modified box plot which plots outliers as isolated points and shows more detail. A modified box plot is a graph of the five number summary with outliers plotted individually. (see page 47).
****************************
The five number summary is NOT the most common numerical description of a distribution. Rather, a combination of the mean as a measure of center and the standard deviation as a measure of spread is commonly used. Variance (s^{2}) is the average of the square of the deviations from the mean. Standard deviation (s) is the square root of the variance
s^{2}= (x_{1} -
`x)^{2}
+ (x_{2} - `x)^{2}
+ ...+ (x_{n} - `x)^{2
}n - 1 |
The standard deviations show how spread out the data are about their mean. Some deviations will be positive and some will be negative. Curiously, the SUM of the deviations is ALWAYS = 0.
Properties of s (standard deviation):
s measures spread about the mean and should only be used when the mean is chosen as the measure of center
s = 0 only when there is NO SPREAD...all observations have the same value
s is NOT resistant to outliers.
Soon we will learn that the standard deviation is the natural measure of spread for an important class of symmetric distributions, the NORMAL distributions. Logically then, an essential point is the usefulness of a statistical procedure is tied to the shape of the distribution.
What and how to choose....
the five number summary is better for skewed distributions or those containing outliers.
Use the mean and standard deviation for relatively symmetric distributions.
A GRAPH gives the best overall "picture" of a distribution, numerical measures of center and spread give specific facts about the distribution but don't describe its entire shape.
ALWAYS... PLOT THE DATA Sometimes a situation requires comparison of two or more distributions. The best method for comparison is back to back stem plots or side-by-side bar graphs.
Sometimes it is necessary to convert units of measure...we use a linear transformation to accomplish this. The rule is...to produce a change from x to a new x value we add a constant (a) which moves the data vertically and/or multiply by a positive number. Adding a constant amount to each observation DOES NOT change the spread. It does increase the measures of center and quartiles by the same amount. Multiplication increases the measures of center (mean and median) by the same multiple. The measures of spread (standard deviation and IQR) are also multiplied by that factor. Linear transformations DO NOT change the shape of a distribution.
Index