Section 5.1 Introduction to Hypothesis Testing
¶Why Do We Need Hypothesis Tests?
While we now have a good set of tools for estimating the value of a population mean, proportion, or differences between means and proportions, there are some times when we wish to make a decision instead of estimate a value. Consider the following situations.
In order to be profitable, a bus route must have an average of at least 25 paying customers. We wish to decide if a certain route is profitable by collecting ridership information on 15 different occasions during a given month.
In order to pass, a new school bond measure needs to get at least 60% support among voters. You collect a sample of 500 likely voters and find that 61% of them plan to vote for the bond. Is this sufficient evidence to claim that the bond will pass on election day?
A researcher claims that men and women do not get the same number of hours of sleep a night. To test this claim, you take a sample of 250 men and 200 women and determine the number of hours of sleep that each individual in these samples gets.
In each case we make a claim about a population and then gather a sample from that population to determine if our claim is reasonable. As we shall see, we do this in a very methodical way. For example, the 61% of voters who support the school bond in our sample will probably not be enough for us to claim with any level of statistical significance that the bond will pass. A random sample of only 500 voters is likely to have a proportion supporting the bond that is one percent or more different from proportion of support in the entire population.
Definition 5.1.1. Statistical Test of Hypothesis.
A statistical test of hypothesis, or just a hypothesis test, is a four-step process used to make a decision about a population based on a sample. These steps are:
Identify the hypothesis being tested and the underlying assumptions that go with that hypothesis.
From a given sample, compute a test statistic based on the sample statistics and the underlying assumptions about the population.
Determine the likelihood of observing a test statistic at least as extreme as the one computed if our assumptions are correct.
Come to a conclusion about the hypothesis based on the likelihood found above.
We begin Chapter 5 by examining these steps in detail.
Objectives
After finishing this section you should be able to
-
describe the following terms:
Left-Tailed Hypothesis Test
Null and Alternative Hypothesis
P-Value
Right-Tailed Hypothesis Test
Significance Level
Statistical Test of Hypothesis
Test Statistic
Two-Tailed Hypothesis Test
Type I Error
Type II Error
-
accomplish the following tasks:
Formulate null and alternative hypotheses
Understand what a test statistic tells us
Be able to perform a traditional hypothesis test using a test statistic and critical regions
Be able to perform a p-value hypothesis test using a normal distribution
Understand and identify type I and type II errors
Subsection 5.1.1 Formulating Hypotheses
¶The first step in conducting a hypothesis test is to figure out what we are hypothesizing. While this may seem obvious, it is often overlooked or thought to be “understood.” However we must be very careful that we explicitly state two hypotheses, called the null and alternative hypothesis, before we begin a hypothesis test. This is done not only to ensure our own understanding, but also because the numbers used in these hypotheses will be needed in the next step of the hypothesis test.
Definition 5.1.2.
In a statistical test of hypothesis, we compare the alternative hypothesis against the null hypothesis. These hypotheses are:
-
Alternative Hypothesis: \(H_A\).
This is generally the hypothesis that the researcher wishes to support. The alternative hypothesis claims that a population parameter is “greater than,” “less than,” or “not equal to” a certain value.
-
Null Hypothesis: \(H_0\).
The null hypothesis is a contradiction of the alternative hypothesis. This is the commonly held belief about the population parameter, and must involve equality. The null hypothesis is the claim that the research is trying to disprove, reject, or nullify.
As a general rule, the best way to identify these hypothesis is to start by looking for the null hypothesis. It is usually easiest to find \(H_0\) because it must involve equality. Then, you can find \(H_A\) by contradicting \(H_0\text{.}\) Consider the following examples.
Example 5.1.3. Stating Hypotheses About a Mean.
In order to be profitable, a bus route must have an average of at least 25 paying customers. We wish to decide if a certain route is profitable by collecting ridership information on 15 different occasions during a given month. State the null and alternative hypotheses for this test.
We wish to determine if the number of customers on the bus is “at least 25” in order to continue the bus route. Turning this around, we will cancel the bus route if the number of customers is “less than 25.” This then should be our alternative hypothesis. Note also that “less than 25” does not involve equality, while “at least 25,” meaning greater-than-or-equal-to 25 does involve equality, making it a null hypothesis. Since these are average ridership numbers, our hypotheses are:
Example 5.1.4. Stating Hypotheses About a Difference Between Means.
A researcher claims that men and women do not get the same number of hours of sleep a night. To test this claim, you take a sample of 250 men and 200 women and determine the number of hours of sleep that each individual in these samples gets.
In this example, the research claims that two means—hours of sleep for men and hours of sleep for women—are not equal. The opposite of that is that the means are equal. If the means are equal, then their difference is zero, and if the means are not equal, their difference is not zero.
These are stated as the null and alternative hypotheses below.
While we will spend more time on this later in the section, it should be noted here that the outcome of a hypothesis test will be one of two conclusions:
-
“We reject the null hypothesis”.
This means that the sample we found is very unlikely if the null hypothesis were indeed true, so we conclude that the null hypothesis can not be true. Note that because \(H_A\) is the opposite of \(H_0\text{,}\) we must therefore support the alternative hypothesis.
-
“We fail to reject the null hypothesis”.
Notice the wording. We do not say that we have proven the null hypothesis, or even that we support it. We are simply stating that the sample we found is consistent with the assumptions put forth in the null hypothesis. That means that we have found no support for the alternative hypothesis.
Checkpoint 5.1.7.
You wish to test the claim that the average weight of a black bear is more than 300 pounds.
Question: which of the following should be your null hypothesis?
\(\mu \lt 300\)
\(\mu \leq 300\)
\(\mu \not= 300\)
\(\mu \geq 300\)
\(\mu \gt 300\)
(b)
Checkpoint 5.1.8.
A restaurant claims that more than 80% of their customers call them their favorite eating establishment. You wish to test this claim.
Question: what is the alternative hypothesis, \(H_A\text{,}\) in this situation?
\(\mu lt 80\)
\(\mu \leq 80\)
\(p \leq 0.8\)
\(p \gt 0.80\)
\(p \leq 80\)
(d)
Subsection 5.1.2 Test Statistics and Significance Levels
¶Once we have stated our hypotheses, it is time to see if our sample supports those claims. To do this we must translate the information from our sample—which will be a mean, a proportion, a difference between means, etc—into a standardized random variable. That is, we need to find the z-score or t-score that goes with the sample we observed, under the assumption that the null hypothesis is true.
Definition 5.1.9.
The test statistic for a sample is a numerical summary that reduces the sample data to a single value based on the assumptions of the null hypothesis.
Notice that the test statistic is computed based on two pieces of information: the observed sample, and the assumptions made in the null hypothesis. In later sections we will see how to compute test statistics for single means, proportions, and differences between means and proportions.
This test statistic is the first step in determining if the sample we found is unusual. We must also have some guidance as to what the word “unusual” should mean. In this context, the term “unusual” refers to how likely we would be to see this test statistic for a sample if the null hypothesis is indeed true. We have seen this notion of how likely a sample is before. Recall that to construct a confidence interval, we need to pick a confidence level. If, for example, we decide to use the 95% confidence level, we are willing for our confidence interval to be “wrong” 5% of the time. In general, in a \((1-\alpha)100\%\) confidence interval, \(\alpha\) is the likelihood that our sample is unlikely enough that the confidence interval won't contain the true population parameter. When conducting hypothesis tests, this same \(\alpha\) has a special role.
Definition 5.1.10.
The significance level \(\alpha\) in a statistical test of hypothesis is the probability of incorrectly rejecting the null hypothesis with which we are willing to live.
The following chart shows commonly used significance levels, the corresponding confidence level, and the term applied to those values of \(\alpha\text{.}\)
\(\alpha\) | Confidence Level | Terminology |
0.10 | 90% | Tends Towards Significance |
0.05 | 95% | Statistically Significant |
0.01 | 99% | Highly Significant |
Checkpoint 5.1.13.
Recall that significance level and confidence level are related. Suppose that you wish to use the \(\alpha = 0.03\) significance level.
Question: what confidence level goes with this?
97%
Checkpoint 5.1.14.
We have seen that certain significance levels have special names.
Question: which significance level is called “highly significant?”
\(\alpha = 0.01\)
Subsection 5.1.3 Critical Regions and the Traditional Test
¶In order to reach a conclusion in our hypothesis test, we need to compare the test statistic with the significance level. However, the test statistic will typically be a z-score or t-score, while the significance level is a probability. So we need to convert one or the other of them before we can compare them. This can be done in two ways.
We can convert our significance level to critical values and then compare the test statistic to those critical values. This is called a traditional hypothesis test. This is the first method we will use.
We can find convert our test statistic to a probability by finding the probability of getting a test statistic at least as extreme as our test statistic and compare this to \(\alpha\text{.}\) This is called a p-value hypothesis test.
To conduct a traditional hypothesis test, we must divide the test statistic's distribution up into two regions. The rejection region will be those values of the test statistic which are more unusual than the significance level, and will therefore lead to rejecting \(H_0\text{.}\) The acceptance region is the rest of the distribution, containing those values of the test statistic which are not unusual. How we draw these regions depends on the alternative hypothesis being tested. In the pictures that follow, the red areas are the rejection regions and the blue areas are the acceptance regions. Note that \(\theta\) could be a mean or proportion.
Definition 5.1.15.
In a left-tailed hypothesis test, we test the alternative that the population parameter is less-than a given value against the null hypothesis that it is greater-than-or-equal-to that value. We therefore put the full significance level \(\alpha\) in the left tail as shown.
Definition 5.1.17.
In a right-tailed hypothesis test, we test the alternative that the population parameter is greater-than a given value against the null hypothesis that it is less-than-or-equal-to that value. We therefore put the full significance level \(\alpha\) in the right tail as shown.
Definition 5.1.19.
In a two-tailed hypothesis test, we test the alternative that the population parameter is not equal to a given value against the null hypothesis that it is equal to that value. We therefore split the significance level \(\alpha\) equally between the left and right tails.
To determine if a test statistic lies in a rejection region, we find the critical value from the appropriate distribution which separates the rejection region from the acceptance region. In a normal distribution, these would be \(-z_\alpha\text{,}\) \(z_\alpha\text{,}\) or \(\pm z_{\alpha/2}\text{.}\) If our test statistic is more extreme—that is, further into the tail—than this critical value, we reject the null hypothesis. If it is less extreme—closer to the middle—we fail to reject the null hypothesis. Consider the following example.
Example 5.1.21. Conducting a Right-Tailed Traditional Test.
You wish to test the null hypothesis \(H_0:\ \mu \leq 0.5\) against the alternative \(H_A:\ \mu > 0.5\) at the \(\alpha = 0.05\) significance level. If your test statistic of \(z_\text{test} = 2.04\) comes from a standard normal distribution, what conclusion do you make?
We first note that this is a right-tailed test since the alternative hypothesis involves “\(\gt\)”. At the \(\alpha = 0.05\) significance level, our critical value is \(z_{0.05} = 1.645\text{.}\) Since the test statistic \(z_\text{test} = 2.04\) is further into the tail (and thus more unusual) than the critical value, we reject the null hypothesis.
Example 5.1.22. Conducting a Two-Tailed Traditional Test.
You wish to test the null hypothesis \(H_0:\ p = 0.4\) against the alternative \(H_A:\ p \not= 0.4\) at the \(\alpha = 0.01\)significance level. Your test statistic of \(z_\text{test} = 2.40\) comes from a standard normal distribution. What conclusion do you make?
Because the alternative hypothesis involves “not equal to,” this is a two tailed test. So we split the significance level \(\alpha = 0.01\) evenly between the two tails, giving us critical values \(z_{0.005} = \pm 2.575\text{.}\) To be more extreme, our test statistic would either need to be bigger than 2.575, or less than -2.575. Because 2.40 is neither, the test statistic lies in the acceptance region. We must therefore fail to reject the null hypothesis.
Checkpoint 5.1.26.
In testing the hypotheses:
you find a test statistic of \(z_\text{test} = -1.88\text{.}\)
Question: if this test statistic comes from a standard normal distribution, what decision do you make at the \(\alpha = 0.05\) significance level?
Reject The Null Hypothesis
Checkpoint 5.1.27.
In testing the hypotheses:
you find a test statistic of \(z_\text{test} = 2.41\text{.}\)
Question: if this test statistic comes from a standard normal distribution, what decision should you make at the \(\alpha = 0.01\) significance level?
Fail to Reject the Null Hypothesis
Subsection 5.1.4 p-Value Tests
¶The second method of deciding if a test statistic is unusual enough to warrant rejecting the null hypothesis is called a p-value test. In the traditional test seen on the last page, we looked up critical values based on the probability \(\alpha\) and then compared our test statistics to those critical values. In a p-value test, we look up a probability for the test statistic, and then compare it directly with the significance level \(\alpha\text{.}\)
Definition 5.1.28.
The p-value of a test statistic is the probability of observing a sample that is at least as unusual as the sample on which the test statistic is based.
When computing p-values, we must still pay attention to the type of test being conducted.
-
Left-Tailed Test.
The p-value of the test statistic is the probability to the left of that test statistic in the distribution. This is the probability of finding a value less than the test statistic.
-
Right-Tailed Test.
The p-value of the test statistic is the probability to the right of the test statistic in the distribution. This is the probability of finding a value greater than the test statistic.
-
Two-Tailed Test.
The p-value of the test statistic is the probability further into either the left or right tail than the test statistic. To determine this we find the probability of being further to the left than the left-tailed “version” of the test statistic, and add the probability of being further to the right than the right-tailed “version” of the test statistic.
Once we have computed the p-value of a test static, we then compare this probability to the significance level \(\alpha\text{.}\) If the p-value is less than \(\alpha\text{,}\) then our sample is less likely than our significance level, so we reject the null hypothesis. We have seen a sample that is too unusual for us to accept that the null hypothesis is true. If the p-value is greater than \(\alpha\text{,}\) then we fail to reject the null hypothesis. Our sample is more likely than the significance level, so there is no reason to suspect that \(H_0\) is incorrect.
Compare the example below with Example 5.1.21 to see how the same test is conducted using a p-value.
Example 5.1.29. Finding the p-Value for a Right-Tailed Hypothesis Test.
You wish to test the null hypothesis \(H_0:\ \mu \leq 0.5\) against the alternative \(H_A:\ \mu > 0.5\) at the \(\alpha = 0.05\) significance level. If your test statistic of \(z_\text{test} = 2.04\) comes from a standard normal distribution, find its p-value and decide what conclusion should be made.
This is a right-tailed test. The p-value is therefore the probability to the right of \(2.04\) in the standard normal distribution. Using the table, this is
Because the p-value is less than the significance level \(0.05\text{,}\) our sample is less likely they we are willing to tolerate. We must therefore reject the null hypothesis.
And compare the following example with Example 5.1.22.
Example 5.1.30. Finding the p-Value for a Two-Tailed Hypothesis Test.
You wish to test the null hypothesis \(H_0:\ p = 0.4\) against the alternative \(H_A:\ p \not= 0.4\) at the \(\alpha = 0.01\) significance level. Your test statistic of \(z_\text{test} = 2.40\) comes from a standard normal distribution. Find the associated p-value and state the conclusion you should reach.
This test is a two-tailed test. The p-value is therefore the area above 2.40, and below -2.40 in the standard normal distribution. This can be found by taking twice the area above 2.40, which is
Because this p-value is greater than the significance level \(\alpha = 0.01\text{,}\) we fail to reject the null hypothesis.
Notice that in the two examples above the results we found, either rejecting or failing to reject \(H_0\text{,}\) were the same as when we made our decision using a critical region and a traditional test. That is not a coincidence! A p-value test and a traditional test will always give you the same results when conducted at the same significance level. The p-value test can, however, provide more information since the reader can see not only the final conclusion, but the actual probability of getting the sample that was collected.
Checkpoint 5.1.34.
In conducting a right-tailed test, you find a test statistic of \(z_\text{test} = 1.70\) from a standard normal distribution.
Question: what is the p-value of this test statistic?
0.0446
Checkpoint 5.1.35.
In conducting a two-tailed test, you find a test statistic \(z_\text{test} = 2.44\) from a standard normal distribution.
Question: what is the p-value of this test statistic?
0.0146
Subsection 5.1.5 Type I and Type II Errors
¶Whenever we conduct a hypothesis test, there is a chance that we will make an error. When we talk about an error in a hypothesis test, we are not talking about a mistake in the computation, or missing a step in the process. The errors we could make are statistical errors. A statistical error happens when we get an unusual sample and wind up “correctly” making the wrong decision. Consider the following table.
The green boxes represent the two correct decisions that we can make. The red boxes represent the two ways in which we can make an error. We can mistakenly reject a true null hypothesis, or we can fail to reject a null hypothesis that is false. These errors have specific names as defined below.
Definition 5.1.37.
A type I error in a hypothesis test is the error of mistakenly rejecting a true null hypothesis. The probability of making a type I error is \(\alpha\text{,}\) the significance level.
Definition 5.1.38.
A type II error in a hypothesis test is the error of failing to reject a false null hypothesis. The probability of making a type II error is called \(\beta\) and depends on the true value of the population parameter.
In the following examples, we will identify type I and type II errors and compute their probabilities.
Example 5.1.39. Computing the Probability of a Type I Error.
In order to be profitable, a bus route must have an average of at least 25 paying customers. We wish to decide if a certain route is profitable by collecting ridership information on 15 different occasions during a given month and conducting a hypothesis test at the \(\alpha = 0.05\) significance level.
After collecting your sample, you decide that the bus route is not profitable, even though the true average number of riders is 25. What type of error did you make, and what was its probability?
Recall that the null and alternative hypotheses were:
If the true average is 25, then the null hypothesis is correct. But when we said the bus route was not profitable, we rejected \(H_0\text{.}\) This is therefore a type I error—rejecting a true null hypothesis.
The probability of this error is the significance level, \(\alpha = 0.05\text{.}\)
Example 5.1.40. Identifying a Type II Error.
In order to pass, a new school bond measure needs to get at least 60% support among voters. You collect a sample of 500 likely voters and find that 61% of them plan to vote for the bond. You conduct a hypothesis test at the \(\alpha = 0.10\) significance level and conclude that there is at least 60% support for the bond, and it will pass. However, come election day only 59% of voters vote for the bond and it fails. What type of error did you make?
The null and alternative hypotheses for this test are:
The true population proportion was 0.59, meaning the null hypothesis was wrong. But in our test we decided the bond would pass, meaning we failed to reject the null hypothesis. This is therefore a type II error.
Note that finding the value of \(\beta\text{,}\) the probability of a type II error, is more difficult and will not be covered in this text.
Checkpoint 5.1.43.
The null hypothesis states that \(p = 0.4\text{.}\) Based on your test statistic, you reject this null hypothesis. But in actuality, \(p = 0.4\text{.}\)
Question: what type of error have you made?
Type I Error
Checkpoint 5.1.44.
The null hypothesis states that a mean is 20. You fail to reject this hypothesis based on your test statistic. However, in actuality the mean is 21.
Question: what type of error have you made?
Type II Error