Understanding Data

Section 1.1 Understanding Data

As our definition of statistics is centered around understanding data, we will begin with an examination of data collection and classification.

Objectives

After finishing this section you should be able to

describe the following terms:
- bias
- bivariate
- causation
- census
- cluster sample
- continuous variable
- convenience sample
- correlation
- data
- discrete variable
- experiment
- factor
- leading question
- lurking variable
- multivariate
- observational study
- population
- qualitative variable
- quantitative variable
- sample
- simple random sample
- stratified sample
- systematic sample
- treatment
- univariate
- variable
- voluntary response sample
accomplish the following tasks:
- Classify data by the number and type of variables in contains.
- Differenttiate between populations and samples.
- Identify and Describe valid Sampling Techniques.
- Differentiate between Experiments and Observational Studies.
- Identify and correct common errors in data collection or analysis.

Subsection 1.1.1 What is Data?

Statistics is the study of data. So it is vitally important that we understand what data is. Throughout this text we need to be sure that we all agree on what terms mean. Whenever we introduce a new term, it will be defined as follows.

Definition 1.1.1.

Any collection of measurements or values is referred to collectively as data.

We can understand and classify data better if we first understand the source from which the measurements are taken. Each type of measurement such as height, length of time, or favorite pizza topping, comes from a certain characteristic or property called a variable.

Definition 1.1.2.

A variable is a single characteristic that is being measured or changed.

In a typical set of data, one or more variables are identified, and the measurements are collected for each of those variables.

Example 1.1.3. Identifying Variables.

A wildlife conservationist wishes to better understand the population of fish that live in a certain lake. To do this, she catches 100 fish and records the species, length, and weight of each fish. Identify the variables in this set of data.

Solution

The three characteristics measured are: fish species, fish length, and fish weight. These are the variables.

Subsubsection 1.1.1.1 Classification by Number of Variables

One way in which we can classify data is based on the number of variables that are being measured. In this class we will look at data sets composed of measurements from one, two, or more variables. These are given the following names.

Definition 1.1.4.

A data set can be named based on the number of variables being measured as follows.

Univariate data contains measurements from a single variable.
Bivariate data contains measurements from two variables. This is sometimes called paired data
Multivariate data contains measurements from three or more variables.

Let's look at an example.

Example 1.1.5. Classifying Datasets.

Identify each of the following data sets based on the number of variables being measured.

A wildlife conservationist records the species, length, and weight of 100 fish caught in a certain lake.
To better understand how tall tomato plants grow, you record the height of 40 tomato plants at a local nursery.
The mayor wants to know if the local bond measure will pass in the next election. He asks 250 city residents how old they are, and if they plan to support the measure.
As a part of the admission process, a school requires students to submit their name, phone number, zip code, email address, and expected major.

Solution

Multivariate - there are three variables: species, length, and weight.
Univariate - since there is only one variable, height, this is univariate data.
Bivariate - collecting age and support for the bond means that there are two variables.
Multivariate - there are five variables in this data set, so it must be multivariate.

Figure 1.1.6. Number of Variables I

Figure 1.1.7. Number of Variables II

Checkpoint 1.1.8. Number of Variables.

A high school math teacher asks her students to write down their favorite subject on a blank piece of paper. She then collects these anonymous papers and records the favorite subject of each student.

Question: What type of data did the teacher record?

Hint

The answer is either univariate, bivariate, or Multivariate

Answer

Univariate

Checkpoint 1.1.9. Number of Variables.

A country veterinarian is collecting data on the lifespan of various breeds of horses. To do this, she records the age and breed of each horse that dies in her region.

Question: What type of data is this?

Hint

How many pieces of information did the veterinarian collect for each horse?

Answer

Bivariate

Checkpoint 1.1.10. Number of Variables.

A state trooper, growing bored with her job, decides to pass the time recording data about the cars that pass her speed trap. She writes down the make, model, color, and speed of each of the cars that drive past.

Question: What type of data is she collecting?

Answer

Multivariate

Subsubsection 1.1.1.2 Types of Variable

Another way to classify data is based on the types of variables that are being measured. The type of a variable depends on what is being recorded. Consider the following example.

Example 1.1.11. Classifying Variables.

The post office, wishing to better understand its customers, decides to collect data from 1000 households within a certain state. They mail each household a questionnaire asking the residents to confirm their zip code, list the number of people who live in that household, and give the length of time that they have lived there. What are the variables in this study, and how do the measurements differ?

Solution

The three variables in this multivariate set of data are: zip code, number of residents, and length of time in residence. The characteristics of these variables are:

Zip Code.

While the values recorded are numbers (i.e. 99324, 49103, 92357, etc) they actually represent categories and not numerical values. Notice that it would make no sense to “average” two zip codes!
Number of Residents.

These values are numbers (i.e. 2 residents, 5 residents, 1 resident, etc) and they do actually represent numerical quantities. Notice that it does make sense to talk about the “average number of residents.” Also, note that the values that this variable can take on are all whole numbers. In particular, there are “gaps” between the values (i.e. 1, 2, 3, etc.).
Length of Residence.

The values for this final variable are also numbers (i.e. 1 year, 2 months, and 12 days = 1.2 years). These numbers do make sense in an average, and they also are different from the number of residents values because there is no space in between them. In other words, if I've lived at my house for 0.9 years, and you've lived at yours for 1.0 years, then there is still room “in between” those two values for somebody who has lived that their residence for 0.97 years.

This example helps point out the three types of variables that we will be studying. The first type, zip code, is an example of a variable that really yields non-numerical data. The values of such a variable represent categories into which the individuals can be grouped, or in other words a “quality” instead of a “quantity.”

Definition 1.1.12.

A qualitative variable (also called a categorical variable) is one that takes on values representing categories or descriptions as opposed to numerical values.

In this text we will spend the majority of our time working with variables whose values are numbers that can be added, divided, etc. These types of values represent a quantity as opposed to a quality, and they are named accordingly.

Definition 1.1.13.

A quantitative variable is one that takes on numerical values. Unlike a qualitative variable, the values of a quantitative variable can be used in computations.

In our example we saw that even though Number of Residents and Length of Residence are both quantitative variables, the measurements we take are still somewhat different. Because of these differences, we need two more definitions that apply only to quantitative variables.

Definition 1.1.14.

A discrete variable is a quantitative variable in which there are gaps between the possible values of the variable.

Definition 1.1.15.

A continuous variable is a quantitative variable for which there are an infinite number of possible values with no gaps between these values.

Before we move on, let's revisit the example above and give the formal types of each of the three variables.

Example 1.1.16. Classifying Variables, Revisited.

The post office, wishing to better understand its customers, decides to collect data from 1000 ouseholds within a certain state. They mail each household a questionnaire sking the esidents to confirm their zip code, list the number of people who live in that husehold, and give the length of time that they have lived there. Identify and classify each of the variables in this survey.

Solution

The three variables in this multivariate set are:

Zip Code.

qualitative variable
Number of Residents.

quantitative and discrete variable
Length of Residence.

quantitative and continuous variable

Now for some more video examples.

Figure 1.1.17. Types of Variables I

Figure 1.1.18. Types of Variables II

Checkpoint 1.1.19. Types of Variables.

A biologist studying the nesting habits of seagulls decides that as a part of her study, she will take an inventory of all nests at a certain nesting ground each morning for a month. She will count the number of chicks in each nest, record whether a parent gull is present, and specify the distance from each nest to its closest neighboring nest.

Question: Identify the type of each variable in this study.

Hint

Your options for each variable are qualitative, quantitative and discrete, or quantitative and continuous.

Answer

Number of Chicks.

quantitative and discrete
Presence of Parents.

qualitative
Distance to Closest Neighboring Nest.

quantitative and continuous

Checkpoint 1.1.20. Types of Variables.

A nursing student wishes to measure the effect of sleep deprivation on exam performance. He surveys other students after the most recent nursing exam is returned and asks how much sleep they got on the night before the exam, their exam score, and whether they were tired during the exam.

Question: What type of variable is the amount of sleep?

Hint

Does sleep come in chunks of time?

Answer

Quantitative and Continuous

Checkpoint 1.1.21. Types of Variables.

Question: What type of variable is records if they tired?

Answer

Qualitative

Subsection 1.1.2 Collecting Data

Before we even have data to analyze or describe, we must first collect the data.

Subsubsection 1.1.2.1 Populations and Samples

The process we use to collect data is just as important, if not more important, than the tools we use to analyze that data. Consider the following example.

Example 1.1.22. Representative Data.

Sam the high school student, who has not had a statistics class, decides to do a research project on the height of an average student at his school. Since he works for the basketball coach, Sam decides to look up the heights of all students who have played for the basketball team over the last 10 years. How useful will this data be for Sam's research project?

Solution

This data will be of no use in researching the height of an average high school student. The data was collected from a small group of students who will, in general, be much taller than the typical high school student.

Our analysis of a set of data will do us no good if that data does not come from the objects that we are trying to study. To better understand this concept, we need to first distinguish between the subjects we are trying to study, and the individuals from whom we actually collect data.

Definition 1.1.23.

A population is the collection of all subjects to be studied.

Definition 1.1.24.

A sample is a sub-collection of subjects drawn from a given population.

Using these new terms, the problem with Sam's research is that the sample he selected is not representative of the population he is studying. One way to remedy this problem would be to collect the height of all students at Sam's school. That is, he could measure the entire population.

Definition 1.1.25.

In a census, data is collected from every member of the population being studied.

However, if Sam attends a large high school, taking a census may not be possible. If we think of other examples, we soon come to the realization that in many instances the population being studied is either too large or it is too expensive for us to collect data from every member of the population. In those cases, we must carefully select a sample that is representative of our population.

Figure 1.1.26. Populations and Samples I

Figure 1.1.27. Populations and Samples II

Checkpoint 1.1.28. Populations and Samples.

To test the administrative side of the site, our web programmer randomly selects 50 people who rate their Internet skill level as “average” and asks them to upload a book information and view customer orders.

Question: Which of the following is an accurate statements?

The programmer has not defined the population that she wishes to study.
The programmer has used a census when she should be taking a sample.
The programmer is using a sample that does not represent the population.
The programmer is using a convenience sample.

Answer

The programmer is using a sample that does not represent the population.

Checkpoint 1.1.29. Populations and Samples.

An engineer wishes to determine if a particular type of bracket can withstand a given amount of weight. To test this, he randomly selects 100 brackets produced at plant A, one of four different bracket manufacturing plants, during the month of April, 2019.

Question: From what population is the engineer actually sampling?

The population of all brackets of this design.
The population of all brackets produced at plant A.
The population of all brackets produced at plant A during April 2019.
The population of 100 brackets selected from those produced at plant A during April 2019.

Answer

The population of all brackets produced at plant A during April 2019.

Checkpoint 1.1.30. Populations and Samples.

A web programmer has just finished writing an online book store website. She needs to determine if the site will be usable. There are two parts to this website. The public section will allow a general Internet user to search for and purchase books from the store. The second administrative section enables the five book store managers for whom the website is being written to enter the books they wish to sell and view customer orders. To test the public side of the site, our web programmer randomly selects 50 people who claim their Internet skill level is “average” and asks them try out the public site and rate its usability. To test the administrative side, she has all five book store managers test the site and rate its usability.

Question: Identify each of the following as a population, sample, or census.

The data collected from the five managers
Internet users for whom the public site is designed
The five managers for whom the administrative site is designed
The 50 users who tested the public site.

Answer

These groups are identified as follows:

census
population
population
sample

Subsubsection 1.1.2.2 Sampling Plans

Since it is important that we select a sample that represents the population we are studying, we need to be very careful when selecting a sample. We not only need to guard against accidentally designing a sample that does not represent the population, but also against unintentional bias in the way we collect our data. One of the best tools in guarding against such biased samples is randomness, and the simplest way to use randomness to collect a sample is to collect a simple random sample.

Definition 1.1.31.

A simple random sample of size \(n\) is a sample drawn from a population in such a way that every group of \(n\) subjects in the population is equally likely to be chosen.

To better understand the concept of a simple random sample, consider the following example.

Example 1.1.32. Simple Random Samples.

A professor wishes to randomly select a sample of 6 students from her class of 48. The students are sitting in eight rows of six students each. The professor comes up with two sampling methods.

Assign each student a number from 1 to 48, and then write these numbers on slips of paper, put them in a bag, and randomly select six of these slips to determine which students will be included in the sample.
Assign each row of students a random number from 1 to 8. Then randomly select one number and include all six of the students in that row in the sample.

Which of these two methods will result in a simple random sample?

Solution

The key to solving this problem is deciding if using a particular method give each group of six students the same chance of being chosen.

This method is a simple random sample. Each group of six students has the same chance of being chosen (we'll see how to compute this when we study probability).
This method is not a simple random sample. To see this, suppose that Bob is sitting in row 1 and Mary in row 2. Then it is impossible for Bob and Mary to be chosen in the same sample. This means that any group including both Bob and Mary has no chance of being chosen, and therefore all groups of six are not equally likely.

Ensuring that we get a simple random sample can be difficult. Even if we manage to design a simple random sample, we still can not be certain that we will get a sample that is representative of the population we are studying. There are several alternative sampling techniques that can help us select a representative sample, and at the same time make it easier to select our sample. Each of these methods employs randomness. However, none of them will result in a true simple random sample.

Definition 1.1.33.

A systematic sample is a random sample collected from an ordered population by randomly choosing one of the first \(k\) subjects, and then selecting every \(k\)th subject thereafter.

One of the first 5 dots was randomly selected, and then every 5th dot thereafter is selected for the sample. — Figure 1.1.34. Systematic Sample: Subject #2 Then Every 5th Thereafter

A systematic sample is particularly useful when the population from which we are sampling is already ordered. Examples of this include products coming off of an assembly line, or cars moving through a sobriety check point.

Definition 1.1.35.

In a stratified sample, a population is broken into several groups (called strata) and then a simple random sample is collected from each of these groups.

The dots have been divided into six groups, and then two dots are randomly selected from each group to form the sample. — Figure 1.1.36. Stratified Sample: Three Sampled from Each Stratum

In a stratified sample we group the population by characteristics which we think may make a difference, and then select randomly from each group. For example, we may want to ensure that both men and women are included in a drug study, so we select some men and some women. Or we may wish to ensure that voters of all political leanings are included in an election poll.

Definition 1.1.37.

A cluster sample is taken by dividing the population into several groups (called clusters) and then picking a simple random sample of the clusters. We then sample each individual in the selected clusters.

The dots have been divided into six groups, and then three of those groups are randomly selected and every dot from the selected groups is added to the sample. — Figure 1.1.38. Cluster Sample: All Individuals in Three Clusters

The difference between a cluster sample and a stratified sample is sometimes confusing. We can, however, tell the difference by looking at when the simple random sample is taken. In a stratified sample, we take separate simple random samples from each group or strata. In a cluster sample, however, we get a simple random sample of the groups (or clusters) and then sample every individual. Consider the following example.

Example 1.1.39. Sampling Techniques.

Determine the type of sample being taken in each of the situations described below.

The pastor of a large church wishes to know what proportion of his congregation would be interested in a new outreach opportunity. Since the church is so large, he decides to collect a sample by randomly selecting 25 of the 100 active small group classes and asking all members of those 25 classes if they would support the new outreach program.
A county judge wishes to determine how many of the defendants that she sees have a high-school diploma. In order to collect a sample from her docket, she has her clerk use the computer to randomly select 40 case numbers and then checks the educational background of those defendants.
Workers for a consumer survey corporation collect data at a local mall by selecting a strategic location and then randomly selecting one of the first 50 people who walk by to ask to take the survey. They then ask every 50th person who walks by to complete the survey.
A golf pro shop wants to determine the average amount of money spent by the average customer. They suspect that men spend more in the shop than women. They therefore divide their customers into two groups: men and women. They then randomly select 40 of the male and 35 of the female customers and examine their most recent purchase.

Solution

The sampling techniques used are as follows.

The small group classes in this example are clusters. Since the pastor randomly selects several of these clusters and then questions every member of the class, this is a cluster sample.
In this case the judge is conducting a simple random sample. There is no special technique being used.
Since the worker in this situation is looking at every 50th person, this is a systematic sample.
The golf shop splits the population into two strata, men and women, and then randomly selects a few individuals from each of these strata. This is therefore a stratified sample.

Figure 1.1.40. Sampling Techniques I

Figure 1.1.41. Sampling Techniques II

Checkpoint 1.1.42. Types of Samples.

Five samples are described below.

Testing Cars.

Automakers are required by the government to do thorough safety inspections on a certain percent of the cars they produce. The thinking is that if there are manufacturing defects in any of the cars, they should appear in a representative sample. to meet these requirements, a certain automaker decides that they will randomly pick one of the first 25 cars that comes off of the assembly line and then check every 25th car thereafter.
Selecting Candy.

In order to the check the claim that 20% of Skittles candy is red, you decide to buy a large number of bags of Skittles and check the proportion of candies that are red. To accomplish this you start by listing the convenience stores in your town in alphabetical order. Then you randomly select 15 of those stores, and purchase all of the bags of Skittles at each of those stores.
Polling Students.

A statistics student wishes to determine the average number of math classes taken by a university student in a given year. To do this, she decides that she will randomly select 50 Freshmen, 50 Sophomores, 50 Juniors, and 50 Seniors and ask them how many math classes they have taken in the last year.
Measuring Fish.

A game warden wishes to determine the proportion of fish in a certain lake that are of an acceptable length for fishermen to keep. The warden decides to row around the lake catching one fish, throwing it back, and moving on to catch another until he has caught and measured a total of 100 fish.
State Survey.

Wishing to determine the average number of people in a typical household, a state government randomly selects 10 counties in the state and mails a survey to every household in those counties.

Question: Determine the type of sampling technique used in each of the situations described above.

Answer

These sampling techniques used are as follows:

systematic sample
cluster sample
stratified sample
simple random sample
cluster sample

Checkpoint 1.1.43. Types of Samples.

Five samples are described below.

Finding Favorite Pizza Toppings.

A pizza parlor wishes to determine its customers' favorite pizza topping. To do this, they decide that on each day of the coming week they will randomly pick one of the first 10 people who order a pizza and ask them their favorite topping. They will then ask every 10th person after this randomly selected person for their favorite pizza topping.
Identifying Road Rage.

A sociologist wishes to measure the percent of traffic accidents that are at least partially caused by road rage. To collect his data, the researcher randomly selects 30 accidents involving motorcycles, 50 involving passenger trucks, and 100 involving passenger cars. He then questions witnesses to these accidents looking for evidence that road rage was involved.
Classifying UFOs.

Your crazy uncle believes that there is a conspiracy to cover up UFO sightings. To check his findings, he randomly selects 10 of the 50 states and attempts to get data of every UFO sighting in those 10 states. Unfortunately, men in black suburbans show up and haul him away before he can complete his study.
Religious Affiliation.

A religious organization wishes to determine what percent of Americans consider themselves Christian. To accomplish this, the divide the country into regions: northwest, southwest, mid-west south, and northeast. They randomly select 500 registered voters from each region and ask them if they consider themselves to be Christian.
Measuring Student Outcomes.

A high school principal needs to determine if her students are able to read at an appropriate grade level. To accomplish this, she assigns each student a random number, and has a computer randomly select 200 of the students. She then gives these 200 students a reading comprehension test to determine if their reading level is grade-appropriate.

Question: Determine the type of sampling technique used in each of the situations described above.

Answer

These sampling techniques used are as follows:

systematic sample
stratified sample
cluster sample
stratified sample
simple random sample

Checkpoint 1.1.44. Types of Samples.

Five samples are described below.

Determining Drug Effectiveness.

In order to determine the effectiveness of a drug, a doctor randomly selects 100 people who have taken the drug. The doctor does not differentiate based on gender, race, or condition, but simply checks to see how quickly the patient recovered after starting the drug therapy.
Measuring Ballot Measure Support.

A polling organization has been charged with the task of determining if a ballot measure will pass in the next election. To get a balanced sample, the organization decides that they will randomly select 450 Republicans, 450 Democrats, and 200 Independents and ask them if they plan to vote for the ballot measure.
Computing Computer Costs.

A popular computing magazine editor wishes to determine the cost of a typical computer. To do this, she asks the magazine's readers to write in if they have purchased a computer in the last six months, and give the price of the new computer. The editor takes the first response she receives and then takes every 5th response that comes in after that.
Identifying Climate Change.

In order to determine the average global temperature, a climate research organization assigns a number to each temperature measuring station in the United States. They then randomly select 250 of these stations and look at the average memperature measured.
Finding Insect Genders.

An entomologist decides to determine what proportion of insects in her back yard are male. To do this she divides her back yard into a 10 by10 grid. She then randomly selects 20 of these grid squares and determines the sex of each insect in those squares.

Question: Determine the type of sampling technique used in each of the situations described above.

Answer

These sampling techniques used are as follows:

simple random sample
stratified sample
systematic sample
simple random sample
cluster sample

Subsubsection 1.1.2.3 Types of Studies

A sample may be collected from a population for one of two reasons. We may simply wish to know a characteristic of that population. On the other hand, we may want to try to change a characteristic and then measure the result. These result in two different types of studies.

Definition 1.1.45.

In an observational study specific characteristics are observed and measured, but no attempt is made to change these characteristics.

Definition 1.1.46.

In an experiment, some treatment is applied and its effects on the subject are observed.

Notice that the main difference is that in an experiment we attempt to change a characteristic whereas in an observational study, we do not. Because experiments seek to measure the effect of a treatment on the subject, they are by nature more complicated than observational studies. Below are some specific terms used when designing an experiment.

Definition 1.1.47.

When conducting an experiment, the following terms are often used.

A factor is an independent variable the values of which are varied by the experimenter.
A treatment is a specific combination of factor values.

To help see the difference between these two types of studies, let's look at a couple of examples.

Example 1.1.48. Types of Studies.

Sam wishes to determine the length of time that it takes a radish seed to germinate. He plants radishes in his garden, and continues his normal watering and weeding process. When the first radish sprouts emerge, Sam records the length of time that has elapsed. Is Sam conducting an observational study or an experiment?

Solution

In this case, Sam is conducting an observational study. He does not attempt to change the conditions under which the seeds are growing, he simply observes the length of time it took sprouts to emerge.

Example 1.1.49. Types of Studies.

Nancy is more ambitions than Sam. She wants to know if radish seeds can be made to germinate faster by altering the conditions in which they grow. To test this theory, Nancy decides that she will plant four groups of radish seeds. The first group will be watered and weeded as normal. The second group will be watered and weeded as well, but will in addition receive fertilizer every other day. The third group will be weeded as normal, but will receive twice as much water. Finally, the fourth group of radish seeds will be weeded as normal, receive fertilizer every other day, and get twice as much water. Nancy will record how long it takes each group of radish seeds to sprout. Is this an observational study or an experiment?

Solution

Nancy is conducting an experiment. By altering the environment of the radish seeds, she is varying the treatment that each group of seeds receives and then measuring the effects. The two factors that Nancy is varying are amount of water and fertilization. The treatments are represented by the four groups:

normal water, no fertilizer
normal water, fertilizer
double water, no fertilizer
double water, fertilizer.

To see more examples, check out the videos below.

Figure 1.1.50. Types of Studies I

Figure 1.1.51. Types of Studies II

Checkpoint 1.1.52. Types of Studies.

A food production company wishes to determine if there is a market for spicy peanut butter. They decide to randomly select 1000 individuals who have purchased their peanut butter products in the past and send them a questionnaire asking if they would be interested in trying spicy peanut butter.

Question: What type of study is this?

Answer

Observational Study

Checkpoint 1.1.53. Types of Studies.

A college student wishes to convince her fellow students that the cost of food in the cafeteria is too high. To do measure the effectiveness of several methods she shows a short video to one group of students, has a group discussion with another, and takes a third group out to lunch at the cafeteria. She then has each group respond to the question “does the cafeteria charge too much.”

Question: What type of study is this?

Answer

Experiment

Checkpoint 1.1.54. Types of Studies.

You receive a phone call from the local campaign for candidate X. After reading you a list of all of the reasons why you should support their candidate, the caller asks if you are planning to vote for candidate Y.

Question: What type of study is this?

Answer

Experiment

Subsubsection 1.1.2.4 Things to Avoid

Now that we have looked at a lot of good tools for collecting data, it is time to think about some things that we need to avoid.

Bad Samples.

If we use a sampling technique that does not include some element of randomness we are likely to produce a sample that does not represent our population. One of the easiest, but most dangerous ways to collect data is by using a convenience sample.

Definition 1.1.55.

A convenience sample is a sample collected by simply selecting the most convenient members of a population.

The problem with this method of sampling is that it is likely that there will be some bias in the selection process.

Definition 1.1.56.

A sample has bias and is called biased if it over-represents one portion of a population and under-represents another.

Consider the following example.

Example 1.1.57. Sources of Bias.

You wish to conduct a study of dessert preferences in Americans. You believe that more people prefer cake to pie, as you do. To gather your data, you ask your immediate family members if they prefer cake or pie and find that all of them prefer cake! What is wrong with your sampling technique?

Solution

Your sample is not representative of the entire population of Americans. By selecting a convenience sample (your family) you have chosen individuals who are likely to have similar preferences--after all, you all grew up with more cake than pie. This is a source of bias in your sample.

Leading Questions.

The way that a question is asked can also be a source of bias in a study. Simple things such as the order of options (“do you like cake or pie?” vs. “do you like pie or cake?”) can cause people to change their answers. Some questions, however, are intentionally worded to evoke a specific response.

Definition 1.1.58.

A question which is phrased in such a way as to elicit a specific answer is called a leading question.

Consider the following example, that actually happened to a Walla Walla University Mathematics professor.

Example 1.1.59. Leading Question.

A National Rifle Association call center employee is told to find out if US voters support the most recent gun control legislation. To determine this, she randomly calls 200 registered voters and asks “do you support the most recent attempt by the gun-hating congress to trample your 2nd amendment rights?” Is there a problem with this questioning technique?

Solution

By characterizing the congress as “Gun-hating” and the bill as “trampling rights” the caller makes it clear what the “correct” answer to this question should be. People are much more likely to say they do not support the legislation after being asked this question.

Voluntary Response.

Another potential source of bias in a sample comes from having people “volunteer” to respond to a survey.

Definition 1.1.60.

A sample in which people choose themselves by responding to a general request to participate is called a voluntary response sample.

If you mail out a questionnaire asking people how they feel about a certain controversial celebrity, you are likely to get responses only from those who care enough to take the time to answer your questions. This leads to responses from only those who have a strong opinions, and most often those with strongly negative opinions. Consider the following actual example.

Example 1.1.61. Voluntary Response Sample.

Ann Landers, the famous advice columnist, once asked her readers to respond to the question “If you had it to do over again, would you have children?” Nearly 70% of the 10,000 parents who responded said that they would not have had children if they could do it over again. Does this accurately reflect the sentiment of parents in general?

Solution

No! This data is worthless as an indication of the opinion of parents in general. Only people who had strong feelings, probably due to bad experiences, were angry enough to take the time to write in. A few months later a well-designed opinion poll on the same topic found that 91% of parents would have children again.

Lurking Variables.

One of the most prevalent things to avoid when using statistics is assuming that because there is a correlation between two things, one must cause the other. In many cases there can be a third (or forth, or fifth) unknown variable that “lurks” in the background and is affecting both observed variables. Let's define these terms precisely and finish with an example.

Definition 1.1.62.

We say that there is correlation between two (or more) variables if certain values of one frequently appear together with certain values of the other.

For example, we would say that there is a correlation between people's height and weight because taller people tend to weigh more. But does being taller cause you to weigh more, or do they just go together?

Definition 1.1.63.

We say that there is causation between two variables if the value of one of the variables causes the value of the other to change.

To understand this term, suppose that we measured the variables “average driving speed” and “number of speeding tickets.” It would be reasonable to say that there is causation between these two variables. That is, driving faster causes one to get more speeding tickets. But it is possible that there is another variable affecting both your driving speed and number of tickets. Suppose, for example, that you drive a fancy red sports car. This is a potential example of a lurking variable.

Definition 1.1.64.

A lurking variable is an unmeasured variable that has an important effect on the relationship between variables in a study.

Consider the following examples.

Example 1.1.65. Luring Variables.

In a study of fuel efficiency and its relationship to weight the Buick Estate Wagon appeared to be an anomaly. Although weighing much more than most cars in its class, the Buick Estate Wagon had a better fuel economy than it should have had in relation to these other cars. Further research indicated that Buick's recommended a higher tire inflation pressure than did the other car manufacturers. Identify the lurking variable in this study.

Solution

Tire Pressure is the lurking variable. While it has an effect on gas mileage (higher pressure = better mileage) it was not included in the original study.

Example 1.1.66. Lurking Variables.

A study of nutrition and TV ownership in Africa found that there was a high correlation between a family suffering from malnutrition and not owning a television set. A statistics student concludes that there is causation as well. That is, owning a television set leads to better nutrition. Critique this conclusion.

Solution

The student is confusing causation and correlation. While there may be correlation between owning a television and having good nutrition, it is not logical to conclude that having a TV improves nutrition. It is much more reasonable to conclude that there is a lurking variable in this study. Perhaps the economic status of the families studied is causing both the differences in nutrition and television ownership.

Review these things to avoid by watching one or both of the following video examples.

Figure 1.1.67. Things to Avoid I

Figure 1.1.68. Things to Avoid II

Checkpoint 1.1.69. Things to Avoid.

Jane works for a advertising firm and has been asked to determine if consumers still like advertising jingles. To make this determination, she asks 20 of her friends from other advertising firms “do you want to risk your job by doing away with advertising jingles?” She finds that, not surprisingly, everybody she polled likes advertising jingles.

Question: Which of the “things to avoid” mentioned in this section has Jane done in her study?

Answer

Jane has both taken a convenience sample and asked a leading question

Checkpoint 1.1.70. Things to Avoid.

A major department store wishes to determine if people with store credit cards enjoy their shopping experience more than those who do not have store credit cards. To research this question, they send surveys to everybody who has shopped at their store in the last year and ask them if they enjoyed their shopping experience, and if they have a store card. They find that the majority of returned surveys are from people who either have a card and enjoyed their shopping experience, or from people who do not have a card and didn't enjoy their shopping experience. The store concludes that having a store card makes shopping with the store more enjoyable.

Question: Which of the “things to avoid” mentioned in this section did the department store do in its study?

Answer

They have both taken a voluntary response sample and confused correlation with causation.

Section 1.1 Understanding Data

Objectives

Subsection 1.1.1 What is Data?

Definition 1.1.1.

Definition 1.1.2.

Example 1.1.3. Identifying Variables.

Subsubsection 1.1.1.1 Classification by Number of Variables

Definition 1.1.4.

Example 1.1.5. Classifying Datasets.

Checkpoint 1.1.8. Number of Variables.

Checkpoint 1.1.9. Number of Variables.

Checkpoint 1.1.10. Number of Variables.

Subsubsection 1.1.1.2 Types of Variable

Example 1.1.11. Classifying Variables.

Zip Code.

Number of Residents.

Length of Residence.

Definition 1.1.12.

Definition 1.1.13.

Definition 1.1.14.

Definition 1.1.15.

Example 1.1.16. Classifying Variables, Revisited.

Zip Code.

Number of Residents.

Length of Residence.

Checkpoint 1.1.19. Types of Variables.

Number of Chicks.

Presence of Parents.

Distance to Closest Neighboring Nest.

Checkpoint 1.1.20. Types of Variables.

Checkpoint 1.1.21. Types of Variables.

Subsection 1.1.2 Collecting Data

Subsubsection 1.1.2.1 Populations and Samples

Example 1.1.22. Representative Data.

Definition 1.1.23.

Definition 1.1.24.

Definition 1.1.25.

Checkpoint 1.1.28. Populations and Samples.

Checkpoint 1.1.29. Populations and Samples.

Checkpoint 1.1.30. Populations and Samples.

Subsubsection 1.1.2.2 Sampling Plans

Definition 1.1.31.

Example 1.1.32. Simple Random Samples.

Definition 1.1.33.

Definition 1.1.35.

Definition 1.1.37.

Example 1.1.39. Sampling Techniques.

Checkpoint 1.1.42. Types of Samples.

Testing Cars.

Selecting Candy.

Polling Students.

Measuring Fish.

State Survey.

Checkpoint 1.1.43. Types of Samples.

Finding Favorite Pizza Toppings.

Identifying Road Rage.

Classifying UFOs.

Religious Affiliation.

Measuring Student Outcomes.

Checkpoint 1.1.44. Types of Samples.

Determining Drug Effectiveness.

Measuring Ballot Measure Support.

Computing Computer Costs.

Identifying Climate Change.

Finding Insect Genders.

Subsubsection 1.1.2.3 Types of Studies

Definition 1.1.45.

Definition 1.1.46.

Definition 1.1.47.

Example 1.1.48. Types of Studies.

Example 1.1.49. Types of Studies.

Checkpoint 1.1.52. Types of Studies.

Checkpoint 1.1.53. Types of Studies.

Checkpoint 1.1.54. Types of Studies.

Subsubsection 1.1.2.4 Things to Avoid

Bad Samples.

Definition 1.1.55.

Definition 1.1.56.

Example 1.1.57. Sources of Bias.

Leading Questions.