The latest survey shows that 3 out of 4 people make up
75% of the world's population.

Chapter 5
    Sec 5.1

Now that we are experts on evaluating someone else's data it is time to learn how to produce valid data that will lead to probable predictions and reliable conclusions.

Since we cannot put a question to the entire population of the U.S. We can put the question to a sample that represents the opinion of the entire population.  How this sample is chosen is extremely important.  A sample is a picture (snapshot) of the entire population with little interference resulting from the act of gathering the info. This is called an observational study.  Other times we can gather data from an experiment where we don't just observe or ask questions....rather, there is active imposition of some treatment in order to observe the response.  So....an observational study observes individuals and measures variables of interest but does NOT attempt to influence the responses while an experiment deliberately imposes some treatment on individuals in order to observe their responses.  Caution:  an observational study is a poor way to gauge the effect of an intervention.  In order to see the response change, we must actually impose the change.  When the goal is to understand cause and effect, experiments are the only source of fully convincing data.

Observational studies of the effect of one variable on another often fail because the explanatory variable is confounded with lurking variables as discussed earlier.  Well designed experiments take steps to defeat confounding.  In some situations, it may not be possible to observe individuals directly or to perform an experiment.  In other cases, it may be logistically difficult or inconvenient to obtain a sample or to impose a treatment.  Simulation provides an alternative method in these circumstances.  After producing data, the next logical step is to use formal statistical inference, which answers specific questions with a known degree of confidence.  Careful design of data production is the most important prerequisite for trustworthy inference.

The entire group of individuals that we want information about is called the population.  A sample is a part of the population that we actually examine in order to gather information.  Note:  Sampling and conducting a census are two distinct ways of collecting data.  Sampling involves studying a part in order to gain information about the whole.  A census attempts to contact every individual in the entire population.  A carefully constructed sample is often more accurate than a census.  Accountants sample a firm's inventory to verify the accuracy of the records because attempting to count every last item in  the warehouse would be expensive and time intensive.  The DESIGN of a sample refers to the method used to choose the sample from the population.  Poor sample designs can produce misleading conclusions.

Some issues that compromise the results of conducting a study on a sample of the population include a voluntary response sample which consists of people who choose themselves by responding to a general appeal, like a telephone call-in poll.  This is one of the common bad sample designs.  Another is convenience sampling which chooses the individuals easiest to reach.  Both voluntary response samples and convenience samples choose a sample that is almost guaranteed not to represent the entire population.  These sampling methods display BIAS or systematic error by favoring some parts of the population over others. 

The statistical remedy for these personal choice bias samples is to allow impersonal CHANCE to choose the sample.  A sample chosen by chance allows neither favoritism by the sampler nor self-selection by respondents.  Choosing a sample by chance attacks bias by giving all individuals an equal chance to be chosen.  The simplest way to use chance to select a sample is to place names in a hat (from the entire population) and draw out a handful (the sample).  This is the idea of SIMPLE RANDOM SAMPLING  (SRS).  A simple random sample of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.  An SRS not only gives each individual an equal chance to be chosen but also gives every possible sample an equal chance to be chosen.  The idea of an SRS is to choose our sample by drawing names from a hat.  In practice, computer software can choose and SRS from a list of individuals in the population by using a random number generator or  by consulting a table of random digits.

A table of random digits is a long string of the digits 0, 1, 2, 3, ..., 9 with two properties:  1.  each entry in the table is equally likely to be any of the 10 digits 0 thru 9 and 2.   the entries are independent of each other.

See Ex. 5.4 (page 276) for explanation of random number usage to create a simple random sample.  The steps are:
1)  Assign a numerical label to every individual in the population
2) Use a table or random number generator to select labels at random. 
The use of CHANCE to select the sample is the essential principle of statistical sampling.  A probability sample is a sample chosen by chance.  We must know what samples are possible and what chance, or probability each possible sample has. 

When sampling from large populations it is common to sample important groups within the population separately, then combine these samples...this is a stratified sample.  To select a stratified random sample, first divide the population into groups of similar individuals, called strata.  Then choose a separate SRS in each stratum and combine these SRSs to form the full sample.

Another common means of restricting random selection is to choose the sample in stages ie. the current population survey uses a multistage sampling design along with opinion polls and other nation samples.

There are a few cautions about using sample surveys in particular.  When the population consists of human beings accurate information from a sample requires much more than a good sampling design.  To begin we need an accurate and complete list of the population.  Because such a list is rarely available, MOST SAMPLES SUFFER FROM SOME DEGREE OF UNDER-COVERAGE.  The results of national sample surveys have some BIAS if the people not covered, who often are poor, differ from the rest of the population.  A more serious source of bias in most sample surveys is NON-RESPONSE, which occurs when a selected individual cannot be contacted or refuses to cooperate.

Again....Under-coverage occurs when some groups in the population are left out of the process of choosing the sample.  Non-response occurs when an individual chosen for the sample can't be contacted or does not cooperate.  Under-coverage is always present with a national census since the list of addresses is incomplete with respect to homeless people.

In addition, the behavior of the respondent of or the interviewer can cause response bias.
Respondents may lie or an interviewer whose attitude suggests that some answers are more desirable than others will get these answers more often.  The race or sex of the interviewer can influence responses or the faulty memory of participants.  Good interviewing technique is another aspect of a well-done sample survey.  The wording of questions is the most important influence on the answers.  Never trust the results of a sample survey until you have read the exact questions posed.  The sampling design, the amount of  non-response, and the date of the survey are also important.  Good statistical design is a part, but only a part, of a trustworthy survey.

Some final comments:  Using chance to choose a sample does eliminate bias in the actual selection of the sample but it is unlikely that results from a sample are exactly the same as for the entire population.  Properly designed samples avoid systematic bias, but their results are rarely exactly correct and they vary from sample to sample.  Because we deliberately use chance, the results obey the laws of probability that govern chance behavior. (We will study all the laws of probability in Chapter 6.)  Results from a survey usually come with a margin of error which we will learn in Chapter 10.  Finally, larger random samples give more accurate results than smaller samples.

Index