Section 5.4 Hypothesis Tests for Differences Between Means and Proportions
¶Testing Claims about Differences.
Just as we constructed confidence intervals for the difference between two population means or proportions in Section 4.4, we can also conduct hypothesis tests for the difference between two means or proportions. The most common tests for differences are tests which seek to determine if two population parameters are equal to each other (so their difference is zero) or if one is greater than the other (so their difference is greater than or less than zero). Consider the following tests.
Early childhood education researchers wish to determine if babies whose parents spend time reading to them will have more success in school than babies who are not read to. To test this claim, they select a sample of 100 high school seniors who were read to as infants and 100 seniors who were not read to as infants. The mean G.P.A. for those who were read to was found to be 2.46 with a standard deviation of 0.77. The mean G.P.A. for the students who were not read to was found to be 2.33 with a standard deviation of 0.86.
An independent senator believes that she has equal support among members of both the Republican and Democrat parties. To test this belief, she commissions a study in which 340 Republicans and 418 Democrats are polled. 138 of the Republicans and 157 of the Democrats are found to support the senator.
In this section we will review how to state the null and alternative hypotheses for examples such as those above, present the test statistic formula for these differences, and finish by conducting both traditional and p-value tests.
Objectives
After finishing this section you should be able to
-
describe the following terms:
Hypotheses for a Difference Between Population Means
Hypotheses for a Difference Between Population Proportions
Pooled Estimate for a Proportion
Test Statistic for a Difference Between Sample Means
Test Statistic for a Difference Between Sample Proportions
-
accomplish the following tasks:
Formulate null and alternative hypotheses for tests of differences
Compute the test statistic for a difference between means
Compute the test statistic for a difference between proportions
Use this test statistic to conduct a traditional hypothesis test
Use this test statistic to conduct a p-value hypothesis test
Subsection 5.4.1 Formulating Hypotheses
¶When formulating hypotheses for the comparison of two means, we will rephrase these comparisons in terms of the difference. For example, if we claim that proportions from two populations are equal,
we would state the null hypothesis as
The possible null and alternative hypotheses for comparing two population proportions are listed below.
Principle 5.4.1. Hypotheses for a Difference Between Population Proportions.
To test a claim comparing two population proportions, we use one of the following sets of hypotheses.
-
Left-Tailed.
\begin{align*} H_0\amp:\ p_1 - p_2 \geq 0\\ H_A\amp:\ p_1 - p_2 \lt 0 \end{align*}(tests \(p_1 \lt p_2\))
-
Two-Tailed.
\begin{align*} H_0\amp:\ p_1 - p_2 = 0\\ H_A\amp:\ p_1 - p_2 \not= 0 \end{align*}(tests \(p_1 \not= p_2\))
-
Right-Tailed.
\begin{align*} H_0\amp:\ p_1 - p_2 \leq 0\\ H_A\amp:\ p_1 - p_2 \gt 0 \end{align*}(tests \(p_1 \gt p_2\))
In each of these tests, the assumption from the null hypothesis is that \(p_1 = p_2\text{,}\) or in other words \(p_1 - p_2 = 0\text{.}\)
When dealing with means, however, we sometimes want a little more flexibility. Instead of saying that the mean of one population is larger than the mean of another, we may wish to say how much larger. For example, the statement “dogs live at least 5 years longer than cats.” can be written as
To get this added flexibility, we state our null and alternative hypothesis in terms of some difference \(d_0\text{,}\) which would have been 5 in this example. If we are testing a claim that two means are equal to each other, we set \(d_0 = 0\text{.}\) In most of our tests, we will use \(d_0 = 0\text{.}\)
Principle 5.4.2. Hypotheses for a Difference Between Population Means.
To test a claim comparing two population means, we use one of the following sets of hypotheses, where \(d_0\) is the claimed difference.
-
Left-Tailed.
\begin{align*} H_0\amp:\ \mu_1 - \mu_2 \geq d_0\\ H_A\amp:\ \mu_1 - \mu_2 \lt d_0 \end{align*}(tests \(\mu_1 \lt \mu_2 + d_0\))
-
Two-Tailed.
\begin{align*} H_0\amp:\ \mu_1 - \mu_2 = d_0\\ H_A\amp:\ \mu_1 - \mu_2 \not= d_0 \end{align*}(tests \(\mu_1 \not= \mu_2 + d_0\))
-
Right-Tailed.
\begin{align*} H_0\amp:\ \mu_1 - \mu_2 \leq d_0\\ H_A\amp:\ \mu_1 - \mu_2 >d_0 \end{align*}(tests \(\mu_1 > \mu_2 + d_0\))
Let's look at several examples involving these hypotheses.
Example 5.4.3. Stating Hypotheses for Differences Between Means.
Early childhood education researchers wish to determine if babies whose parents spend time reading to them will have more success in school than babies who are not read to. To test this claim, they select a sample of 100 high school seniors who were read to as infants and 100 seniors who were not read to as infants. The mean G.P.A. for those who were read to was found to be 2.46 with a standard deviation of 0.77. The mean G.P.A. for the students who were not read to was found to be 2.33 with a standard deviation of 0.86. Formulate hypotheses for this test.
This is a claim about two population means. The researchers believe that those in population 1, the students whose parents read to them, will have a higher mean G.P.A. The null hypothesis is that it makes no difference, so in other words the two means are equal. Therefore the difference is \(d_0 = 0\text{.}\) This gives the following hypotheses:
Example 5.4.4. Stating Hypotheses for Differences Between Proportions.
An independent senator believes that she has equal support among members of both the Republican and Democrat parties. To test this belief, she commissions a study in which 340 Republicans and 418 Democrats are polled. 138 of the Republicans and 157 of the Democrats are found to support the senator. Formulate hypotheses for this test.
This is a claim about two population proportions. The senator believes that the proportions of Republicans (\(p_R\)) and Democrats (\(p_D\)) who support her are equal. Thus, the hypotheses are:
Checkpoint 5.4.7.
A veterinarian believes that dogs and cats have, on average, the same number of offspring in each birth. To test this claim, she takes observes that in 96 cat pregnancies, the average number of offspring was 4.9 with a standard deviation of 1.26 offspring. In 85 dog pregnancies, the vet observed an average of 3.7 offspring with a standard deviation of 0.84 offspring.
Question: what null hypothesis should the vet use to test her claim?
\(\mu_1 = \mu_2\)
Checkpoint 5.4.8.
An IRS agent believes that tax fraud is more prevalent on income tax returns where the gross adjusted income is more than $200,000. He takes a sample of 400 returns with income of less than $200,000 and finds that 12 of them are fraudulent. He also takes a sample of 300 returns with more than $200,000 reported income and finds that 15 of them are fraudulent.
Question: if those making under $200,000 are population 1, what should your alternative hypothesis be in this test?
\(p_1 \lt p_2\)
Subsection 5.4.2 Test Statistic for a Difference Between Means
¶The test statistic for a difference between means measures how unusual the difference between our two sample means would be if the assumed difference from the null hypothesis is true. This measure of “unusualness” is again a z-score from the normal distribution. To find that z-score, we look at the difference between our observed difference and the assumed difference, and then divide that by the standard deviation for the difference between sample means. That mouth-full is represented symbolically below.
Theorem 5.4.9. Test Statistic for a Difference Between Sample Means.
The test statistic for a difference between sample means \((\overline{x}_1 - \overline{x}_2)\) used to test the assumption of the null hypothesis that \((\mu_1 - \mu_2) = d_0\) is:
Note that if \(n_1\) and \(n_2\) are both at least 30, we can use \(\sigma_1 \approx s_1\) and \(\sigma_2 \approx s_2\text{.}\)
The following examples show this computation.
Example 5.4.10. Computing the Test Statistic for a Difference Between Means.
Early childhood education researchers wish to determine if babies whose parents spend time reading to them will have more success in school than babies who are not read to. To test this claim, they select a sample of 100 high school seniors who were read to as infants and 100 seniors who were not read to as infants. The mean G.P.A. for those who were read to was found to be 2.46 with a standard deviation of 0.77. The mean G.P.A. for the students who were not read to was found to be 2.33 with a standard deviation of 0.86. Find the test statistic for the difference between these sample means.
In Example 5.4.3, we formulated the following hypotheses.
From this null hypothesis, we assume \(d_0 = 0\) giving the following test statistic for these samples:
Example 5.4.11. Computing the Test Statistic for a Non-Zero Difference Between Means.
A pet lover believes that dogs live on average at least 5 years longer than cats. To test this claim, he collects data on 63 randomly selected dogs, and 55 randomly selected cats. The average lifespan of the dogs is found to be 18.7 years, with a standard deviation of 3.1 years. The average lifespan for the sample of cats is 12.3 years with a standard deviation of 1.9 years. Find the test statistic for the difference between these sample means.
Because we are testing the claim that \(\mu_1\) is at least 5 more than \(\mu_2\text{,}\) our hypotheses will be:
Under this null hypothesis, the test statistic for the above samples is:
Checkpoint 5.4.14.
A veterinarian believes that dogs and cats have, on average, the same number of offspring in each birth. To test this claim, she takes observes that in 96 cat pregnancies, the average number of offspring was 4.9 with a standard deviation of 1.26 offspring. In 85 dog pregnancies, the vet observed an average of 3.7 offspring with a standard deviation of 0.84 offspring. The cat population is designated as population number one.
Question: what is the test statistic? Round your answer to two decimal places.
7.61
Checkpoint 5.4.15.
A widget manufacturer uses two assembly lines to build widgets. The quality control engineer believes that the average weight of a widget made by the first assembly line is greater than the average weight of a widget made by the second assembly line. To test this theory he takes a sample of widgets from each assembly line and finds the following information.
Sample Size | Mean | Standard Dev. | |
Assembly Line #1 | \(n_1 = 120\) | 12.2 ounces | 0.72 ounces |
Assembly Line #2 | \(n_2 = 120\) | 11.9 ounces | 0.81 ounces |
Question: what is the test statistic for this hypothesis test?
3.03
Subsection 5.4.3 Test Statistic for a Difference Between Proportions
¶When computing the test statistic for a difference between proportions, we again want to measure how unusual the observed difference is. However, because our null hypotheses will always use the assumption that the two proportions are equal, the test statistic formula is slightly simpler.
Theorem 5.4.17. Test Statistic for a Difference Between Sample Proportions.
The test statistic for a difference between sample proportions \((p_1 - p_2)\) used to test the assumption of the null hypothesis that \(p_1 = p_2\) is:
where \(p\) is the proportion of successes in both populations.
The null hypothesis asserts that \(p_1 = p_2\text{,}\) but doesn't tell us what that proportion of successes is. We must approximate \(p\) using the two samples that were drawn from these populations. While it is unlikely that these two sample proportions will equal \(p\) exactlyu, or even each other, by pooling them into a single proportion \(\hat p_{\text{pooled}}\text{,}\) we can get an estimate for the populations' proportion \(p\text{.}\)
Definition 5.4.18.
The pooled estimate for a proportion based on the sample proportions \(\hat p_1\) and \(\hat p_2\) is:
If your sample data is reported in terms of number of successes instead of proportion of successes, then you should use \(x_1\) in place of \(n_1 \hat p_1\) in the above formula, and similarly for \(x_2\text{.}\) Let's see how this pooling works in the examples below.
Example 5.4.19. Computing the Test Statistic for a Difference Between Proportions.
An independent senator believes that she has equal support among members of both the Republican and Democrat parties. To test this belief, she commissions a study in which 340 Republicans and 418 Democrats are polled. 138 of the Republicans and 157 of the Democrats are found to support the senator. Find the test statistic for this hypothesis test.
We have already seen that the hypotheses are:
Under the assumption that \(p_1 = p_2 = p\) for some population proportion \(p\text{,}\) we must approximate \(p\) using a pooled estimate.
Plugging this pooled estimate in for \(p\) in the test statistic formula above yields:
Example 5.4.20. Computing the Test Statistic for Another Difference Between Proportions.
An educator believes that the proportion of females in the US who have completed college is greater than the proportion of males. To test this claim, a sample of 600 women is randomly selected and 227 of them are found to have completed college. A sample of 570 men is randomly selected and only 192 of them are found to have completed college. Find the test statistic for this hypothesis test.
The claim in this test is that \(p_W\text{,}\) the proportion of women who finish college, is bigger than \(p_M\text{,}\) the proportion of men who finish college. This leads to the following hypotheses.
If we assume that \(p_W = p_M = p\text{,}\) we must approximate \(p\) using a pooled proportion from the samples.
Using this in our test statistic formula yields the following test statistic.
Checkpoint 5.4.23.
An IRS agent believes that tax fraud is more prevalent on income tax returns where the gross adjusted income is more than $200,000. He takes a sample of 400 returns with income of less than $200,000 and finds that 12 of them are fraudulent. He also takes a sample of 300 returns with more than $200,000 reported income and finds that 15 of them are fraudulent. Suppose that tax returns for those making over $200,000 make up population one.
Question: what is the test statistic for this test?
1.36
Checkpoint 5.4.24.
A used car salesperson believes that a larger proportion of sports cars sold on her lot are red than the proportion of sedans that are red. To test this hypothesis, she collects the following samples.
Sample Size | Number Red | |
Sports Cars | \(n_1 = 73\) | \(x_1 = 21\) |
Sedans | \(n_2 = 129\) | \(x_2 = 33\) |
Question: what is the test statistic for this situation?
0.49
Subsection 5.4.4 The Traditional Test
¶Conducting a hypothesis test for a difference between means or proportions requires a different set of hypotheses, and a different test statistic formula. However, once we have the test statistic, the rest of the hypothesis test works just as it did for single means or proportions. On this page, we will finish two of the previously seen examples using the traditional test method.
Example 5.4.26. Conducting a Traditional Hypothesis Test for a Difference Between Means.
Early childhood education researchers wish to determine if babies whose parents spend time reading to them will have more success in school than babies who are not read to. To test this claim, they select a sample of 100 high school seniors who were read to as infants and 100 seniors who were not read to as infants. The mean G.P.A. for those who were read to was found to be 2.46 with a standard deviation of 0.77. The mean G.P.A. for the students who were not read to was found to be 2.33 with a standard deviation of 0.86. Conduct a traditional hypothesis test at the \(\alpha = 0.05\) significance level.
As seen before, the hypotheses are:
From this null hypothesis, we computed the test statistic:
Because the alternative hypothesis involves “\(\gt\)”, this is a right-tailed test. Therefore, at the \(\alpha = 0.05\) significance level, our critical value is \(z_{0.05} = 1.645\) as shown below.
Since the test statistic is not larger than the critical value, it is not in that right-tailed rejection region. We must therefore fail to reject the null hypothesis. There is no statistically significant evidence that G.P.A.s are higher for those seniors who were read to as infants.
Example 5.4.28. Conducting a Traditional Hypothesis Test for a Difference Between Proportions.
An educator believes that the proportion of females in the US who have completed college is greater than the proportion of males. To test this claim, a sample of 600 women is randomly selected and 227 of them are found to have completed college. A sample of 570 men is randomly selected and only 192 of them are found to have completed college. Test this educator's claim using a traditional hypothesis test at the \(\alpha = 0.10\) significance level.
From previous work, we have hypotheses:
The pooled proportion for the population is:
Using this in our test statistic formula yielded the following test statistic.
Now because the alternative hypothesis involves “\(\gt\)”, this is a right-tailed test. At the \(\alpha = 0.10\) significance level, the critical value is \(z_{0.10} = 1.28\) as shown below.
Because the test statistic 1.48 is further into the right tail than 1.28, it is in the rejection region. We therefore reject the null hypothesis. There is evidence tending towards significance that a higher proportion of women have finished college than men.
Checkpoint 5.4.32.
A veterinarian believes that dogs and cats have, on average, the same number of offspring in each birth. To test this claim, she takes observes that in 96 cat pregnancies, the average number of offspring was 4.9 with a standard deviation of 1.26 offspring. In 85 dog pregnancies, the vet observed an average of 3.7 offspring with a standard deviation of 0.84 offspring. The cat population is designated as population number one.
Question: what conclusion do you reach using a traditional hypothesis test at the \(\alpha = 0.01\) significance level?
Reject the Null Hypothesis
Checkpoint 5.4.33.
A used car salesperson believes that a larger proportion of sports cars sold on her lot are red than the proportion of sedans that are red. To test this hypothesis, she collects the following samples.
Sample Size | Number Red | |
Sports Cars | \(n_1 = 73\) | \(x_1 = 21\) |
Sedans | \(n_2 = 129\) | \(x_2 = 33\) |
Question: what conclusion do you reach using a traditional hypothesis test at the \(\alpha = 0.05\) significance level?
Fail to Reject the Null Hypothesis
Subsection 5.4.5 The p-Value Test
¶As with the traditional hypothesis test, the p-value test is the same for testing claims about differences as it was for testing claims about individual population parameters. The following examples show how the p-value test works for tests of differences.
Example 5.4.35. Conducting a p-Value Hypothesis Test for a Difference Between Means.
A pet lover believes that dogs live on average at least 5 years longer than cats. To test this claim, he collects data on 63 randomly selected dogs, and 55 randomly selected cats. The average lifespan of the dogs is found to be 18.7 years, with a standard deviation of 3.1 years. The average lifespan for the sample of cats is 12.3 years with a standard deviation of 1.9 years. Conduct a p-value test to see if the pet lover's claim has merit.
As seen earlier in this section, the hypotheses are:
Under this null hypothesis, the test statistic for the above samples was:
Now because the alternative hypothesis involves “\(\gt\)”, this is a right-tailed test. The p-value for the test statistic is the area of the region shown below.
This gives us
which is smaller than all of our standard significance levels of 0.10, 0.05, and 0.01. We therefore reject the null hypothesis at each of these significance levels. There is highly significant evidence that dogs live at least 5 years longer than cats.
You may have noticed that we did not give a significance level at which to conduct our test in this last example. When a p-value test is being conducted, we sometimes don't state a significance level as part of the problem statement. Instead, once we have the p-value for the test, we compare it with all three of the Common Significance Levels to see at which levels, if any, we can reject the null hypothesis.
Example 5.4.37. Conducting a p-Value Hypothesis Test for a Difference Between Proportions.
An independent senator believes that she has equal support among members of both the Republican and Democrat parties. To test this belief, she commissions a study in which 340 Republicans and 418 Democrats are polled. 138 of the Republicans and 157 of the Democrats are found to support the senator. Conduct a p-value test to determine if the senator's support levels are different.
We have already seen that the hypotheses are:
Our pooled estimate for the common proportion was:
Finally, the test statistic was:
Now as the alternative hypothesis involves “\(\not =\)”, this is a two tailed test. The p-value is therefore the probability of being further into either tail than the test statistic of 0.85. This is twice the area in the right tail, as shown.
S from the standard normal distribution table, the p-value is:
Therefore, if the null hypothesis is true and support levels are equal in Republicans and Democrats, we could see samples like this 39.5% of the time. That is not unusual. The p-value of 0.3954 is larger than all common significance levels, 0.10, 0.05, and 0.01. We therefore fail to reject the null hypothesis. There is no evidence that support levels differ between Republicans and Democrats. The senator could well be correct.
Checkpoint 5.4.41.
An IRS agent believes that tax fraud is more prevalent on income tax returns where the gross adjusted income is more than $200,000. He takes a sample of 400 returns with income of less than $200,000 and finds that 12 of them are fraudulent. He also takes a sample of 300 returns with more than $200,000 reported income and finds that 15 of them are fraudulent. Suppose that tax returns for those making over $200,000 make up population one.
Question: what is the p-value of the test statistic for these samples?
0.0869
Checkpoint 5.4.42.
A widget manufacturer uses two assembly lines to build widgets. The quality control engineer believes that the average weight of a widget made by the first assembly line is greater than the average weight of a widget made by the second assembly line. To test this theory he takes a sample of widgets from each assembly line and finds the following information.
Sample Size | Mean | Standard Dev. | |
Assembly Line #1 | \(n_1 = 120\) | 12.2 ounces | 0.72 ounces |
Assembly Line #2 | \(n_2 = 120\) | 11.9 ounces | 0.81 ounces |
Question: what is the p-value for this test statistic?
0.0012