Section 5.5 Hypothesis Tests for Means Using Small Samples
¶Small Sample Hypothesis Tests.
When constructing confidence intervals in chapter 4, we saw that if the sample size was less than 30, we could not assume that the sampling distribution for \(\overline{x}\) was normal with a standard deviation of \(\frac{s}{\sqrt{n}}\text{.}\) This required us to introduce the student's t-distribution in order to work with smaller samples. The same is true when conducting hypothesis tests. Consider the following examples.
You believe that the average lifespan of a supernova is 74 days. Because supernovas are so rare (approximately one every 50 years in our galaxy), you can only find accurate records from 7 supernovas to use in your data. Their average lifespan was 63.8 days with a standard deviation of 12.6 days.
You believe that moon rocks contain less iron than do earth rocks. To test this hypothesis, you collect a sample of 300 rocks from all over the earth and find that they contain an average of 6.7 grams per cubic inch of iron with a standard deviation of 2.9 grams per cubic inch. You only have access to six moon rocks, however, and find that they have an average of 2.4 grams per cubic inch with a standard deviation of 0.35 grams per cubic inch.
In each of these cases the hypothesis testing procedures we've been using up to this point will no longer work. We can not assume that our critical values will have a normal distribution. Under the appropriate assumptions about the population, however, we can use the t-distribution to conduct these tests. Click the continue button to find out how.
Objectives
After finishing this section you should be able to
-
describe the following terms:
Test Statistic for a Difference Between Small-Sample Means
Test Statistic for a Single Small-Sample Mean
-
accomplish the following tasks:
Compute test statistics for small sample tests of a single mean
Compute test statistics for small sample tests of a difference between means
Use the t-distribution to conduct traditional hypothesis tests with small sample test statistics
Use the t-distribution to give ranges for the p-value of a small sample test statistic
Subsection 5.5.1 Test Statistic for a Single Population Mean
¶The test statistic for a single mean when using a small sample will be a t-score. This t-score measures how unusual the observed sample is under the assumption of the null hypothesis. The formula is very similar to the one used with large samples with the exception that it is a t-score and not a z-score, and that it explicitly uses the sample standard deviation \(s\) instead of the population standard deviation \(\sigma\text{.}\)
Theorem 5.5.1. Test Statistic for a Single Small-Sample Mean.
The test statistic for the sample mean \(\overline{x}\) from a sample of size less than 30 drawn from a normal distribution and used to test the null hypothesis that \(\mu = \mu_0\) is:
Note that just as in the case of a confidence interval, the test statistic will belong to a certain member of the t-distribution family based on the degrees of freedom. This will become important later on when we actually conduct the hypothesis test. For now, consider the example below.
Example 5.5.2. Computing a Test Stastic for a Small Sample Mean.
You believe that the average lifespan of a supernova is 74 days. Because supernovas are so rare (approximately one every 50 years in our galaxy), you can only find accurate records from 7 supernovas to use in your data. Their average lifespan was 63.8 days with a standard deviation of 12.6 days. State the hypotheses for this test, and find the test statistic.
The claim we are testing is that the lifespan of a supernova is 74 days. This involves equality, and the alternative is that it is not 74 days—being either larger or smaller. This will give us a two-tailed test with hypotheses:
Our test statistic is the following t-score.
One final note. We must assume that the lifespan of an individual supernova is normally distributed for this test statistic to work.
Checkpoint 5.5.5.
You believe that the average loudness of a rock concert in decibels is at least 117 db. To test this, you visit eleven rock concerts and find an average decibel level of 121 db with a standard deviation of 7.8 db.
Question: what is the test statistic for this hypothesis test? Round your answer to three decimals.
1.701
Checkpoint 5.5.6.
A dentist believes that professional boxers require more than the average 3 fillings found in the average person. To test this hypothesis, the dentist contacts a sample of 13 professional boxers and finds that they have an average of 4.2 fillings with a standard deviation of 1.83.
Question: what is the test statistic for this sample? Round your answer to three decimal places.
2.364
Subsection 5.5.2 Test Statistic for a Difference Between Means
¶We may also wish to conduct a hypothesis test for the difference between population means when the sample from one or both of those populations is small, meaning less than 30. To do this, we must make one additional assumption. Just as was done when building confidence intervals for differences using small sample means, we must assume that the variances of the two populations are equal. Recall that this assumption lead us to use a pooled variance estimate, given by:
where \(s_1\) is the standard deviation of the first sample of size \(n_1\) and \(s_2\) is the standard deviation of the second sample of size \(n_2\text{.}\) While we do not expect that \(s_1^2 = s_2^2\text{,}\) we do assume that they both will be close to the same common variance \(\sigma^2\text{,}\) and we therefore use \(\sigma^2 \approx s^2\) as our approximation in the following test statistic computation.
Theorem 5.5.7. Test Statistic for a Difference Between Small-Sample Means.
The test statistic for a difference between sample means \((\overline{x}_1 - \overline{x}_2)\) use to test the assumption that \((\mu_1 - \mu_2) = d_0\) where at least one of the sample sizes is less than 30 is given by:
where \(s^2\) is the pooled variance estimator.
Once again the test statistic we get is a member of the particular t-distribution having \(n_1 + n_2 - 2\) degrees of freedom. This will be important once we start actually conducting the hypothesis test. An example of this computation can be seen in the following problem.
Example 5.5.8. Computing the Test Statistic for a Small Sample Difference Between Means.
You believe that moon rocks contain less iron than do earth rocks. To test this hypothesis, you collect a sample of 300 rocks from all over the earth and find that they contain an average of 6.7 grams per cubic inch of iron with a standard deviation of 2.9 grams per cubic inch. You only have access to six moon rocks, however, and find that they have an average of 2.4 grams per cubic inch with a standard deviation of 0.35 grams per cubic inch. State hypotheses for this test and determine the test statistic.
The claim we are testing is that moon rocks contain less iron than do earth rocks. This is a claim involving “\(\lt\)” and should therefore be the alternative hypothesis. This leads to the following hypotheses, where the moon rocks are population one and earth rocks population two.
We must assume that the standard deviations are the same in both populations—even though in our samples they are fairly different. We therefore combine the standard deviations to get the following pooled estimate:
Using this in the computation of the test statistic, along with the null hypothesis assumption that the two means are equal, we get the value shown below.
Checkpoint 5.5.11.
A sports statistician believes that the winner of the super bowl, on average, scores at least one touch-down (seven points) more than the loser. To test this theory, the scores of 12 winning teams are averaged to get 26.4 with a standard deviation of 4.6 points. The scores of 11 losing teams are also averaged to get 17.5 with a standard deviation of 5.2 points.
Question: what is the test statistic for this test? Round your answer to three decimal places.
0.9295
Checkpoint 5.5.12.
The air force is evaluating two possible jet fighters based on their top speed. They wish to decide if the top speeds of fighter A and fighter B are the same. In a sample of flights from fighters A and B the following average top speeds are observed.
Sample Size | Sample Mean | Sample St.Dev. | |
Fighter A | \(n_1 = 8\) | mach 2.89 | mach 0.19 |
Fighter B | \(n_2 = 6\) | mach 2.71 | mach 0.14 |
Question: what is the test statistic for this hypothesis test?
1.95
Subsection 5.5.3 The Traditional Test
¶A traditional hypothesis test using test statistics and critical values drawn from a t-distribution works in much the same way as tests using the normal distribution. Recall that the four steps are:
State the null and alternative hypotheses (done).
Compute the test statistic (done).
Find the rejection region and the critical values that separate it from the acceptance region.
Determine if the test statistic is in the rejection region, in which case we reject the null hypothesis, or in the acceptance region, in which case we fail to reject the null hypothesis.
The one catch, as mentioned previously, is that we must choose our critical values from the t-distribution with the appropriate number of degrees of freedom. The value of df will vary with the sample size or sizes with which we are working, but the formulas are the same as those found in Section 4.5 and restated below.
-
Single Sample Mean.
Degrees of freedom is one less than the sample size: \(df = n - 1\text{.}\)
-
Difference Between Means.
Degrees of freedom is two less than the sum of the sample sizes: \(df = n_1 + n_2 - 2\text{.}\)
Let's revisit the two examples we've seen in this lesson so far, finishing them off with a traditional hypothesis test.
Example 5.5.14. Conducting a Traditional Hypothesis Test for a Small Sample Mean.
You believe that the average lifespan of a supernova is 74 days. Because supernovas are so rare (approximately one every 50 years in our galaxy), you can only find accurate records from 7 supernovas to use in your data. Their average lifespan was 63.8 days with a standard deviation of 12.6 days. Conduct a traditional hypothesis test at the \(\alpha = 0.10\) significance level.
We have previously seen that the hypotheses are:
And the test statistic is:
Since this is a two-tailed hypothesis test, the rejection region will split the significance level of \(\alpha = 0.10\) into two tails with area \(0.05\) each as shown below.
The number of degrees of freedom is \(7 - 1 = 6\text{.}\) Therefore, from the t-distribution table, the critical values are \(1.943\) for the right tail, and \(-1.943\) for the left tail. Because our test statistic of \(t_\text{test} = -2.142\) is further into the left tail than the critical value -1.943, it is in the rejection region. We therefore must reject the null hypothesis. There is evidence tending towards significance that supernovas do not have an average lifespan of 74 days.
Example 5.5.16. Conducting a Traditional Hypothesis Test for a Difference Between Small Sample Means.
You believe that moon rocks contain less iron than do earth rocks. To test this hypothesis, you collect a sample of 300 rocks from all over the earth and find that they contain an average of 6.7 grams per cubic inch of iron with a standard deviation of 2.9 grams per cubic inch. You only have access to six moon rocks, however, and find that they have an average of 2.4 grams per cubic inch with a standard deviation of 0.35 grams per cubic inch. Conduct a traditional hypothesis test at the \(\alpha = 0.05\) significance level.
From previous work, the hypotheses are:
Our test statistic is:
This is a left-tailed test (note that the alternative hypothesis involves “\(\lt\)”) and therefore our critical value will be negative. To find it, we look up the positive critical value in the t-distribution table that leaves an area of \(0.05\) (the significance level) in the right tail and has
degrees of freedom. This is clearly 30+, so we use the value \(1.645\) from the t-distribution table. To get the left-tail critical value, we change the sign to get -1.645 as shown below.
Since our test statistic of \(t_\text{test} = -3.626\) is much more extreme than the critical value \(-1.645\text{,}\) we reject the null hypothesis. There is statistically significant evidence that the iron content of moon rocks is less than that of rocks on earth.
Checkpoint 5.5.20.
A sports statistician believes that the winner of the super bowl, on average, scores at least one touch-down (seven points) more than the loser. To test this theory, the scores of 12 winning teams are averaged to get 26.4 with a standard deviation of 4.6 points. The scores of 11 losing teams are also averaged to get 17.5 with a standard deviation of 5.2 points.
Question: what conclusion should be made about the statistician's claim? Use a traditional hypothesis test at the \(\alpha = 0.10\) significance level.
No evidence the winners score at least one touch-down more than the losers.
Checkpoint 5.5.21.
You believe that the average loudness of a rock concert in decibels is at least 117 db. To test this, you visit eleven rock concerts and find an average decibel level of 121 db with a standard deviation of 7.8 db.
Question: what conclusion should you make? Use a traditional test at the \(\alpha = 0.10\) significance level.
Reject the null hypothesis
Subsection 5.5.4 Difficulties of the P-Value Test
¶What about p-value tests? Up to this point, while we've done both traditional and p-value hypothesis tests, p-value tests have been a little more useful since we find the actual probability of finding a sample at least as extreme as the one we got. This allows those who are reading our work to make up their own minds on what significance level should be used. However, when dealing with t-distributions, we run into a problem. Consider the portion of the t-distribution table show below.
df | \(t_{0.100}\) | \(t_{0.050}\) | \(t_{0.025}\) | \(t_{0.010}\) | \(t_{0.005}\) | |||||
1 | 3.078 | 6.314 | 12.706 | 31.821 | 63.657 | |||||
2 | 1.886 | 2.920 | 4.303 | 6.965 | 9.925 | |||||
3 | 1.638 | 2.353 | 3.182 | 4.541 | 5.841 | |||||
4 | 1.533 | 2.132 | 2.776 | 3.747 | 4.604 | |||||
5 | 1.476 | 2.015 | 2.571 | 3.365 | 4.032 |
Suppose we wanted to find the p-value for a right-tailed test statistic \(t_\text{test} = 2.316\) with four degrees of freedom. In the normal distribution table, we would look up the z-score for our test statistic and find the probability. But the t-distribution table is much more limited. The t-scores are in the body of the table, and the probabilities for this small selection of t-scores are listed in the column headings.
So, for this example, we would have to look in the \(df = 4\) row and observe that the test statistic of \(2.316\) is between \(2.132\) and \(2.776\text{.}\) Therefore it's p-value is between \(0.05\) and \(0.025\text{,}\) the probabilities that are at the top of those two columns. Notice that the order changed because a larger test statistic will have a smaller p-value. So, we'll reorder this and state that the p-value is between 0.025 and 0.05.
While this is the best we can do with test statistics from t-distributions using tables, we could use a computer to find p-values for any \(t_\text{test}\) statistic. That process depends on the specific computer program being used and is beyond the scope of this text.
Example 5.5.23. Estimating p-Values for a Small Sample Mean.
You believe that the average lifespan of a supernova is 74 days. Because supernovas are so rare (approximately one every 50 years in our galaxy), you can only find accurate records from 7 supernovas to use in your data. Their average lifespan was 63.8 days with a standard deviation of 12.6 days. Conduct a p-value test for this claim, giving a range into which the p-value must fall.
We have previously seen that the hypotheses are:
The test statistic is:
This test statistic has an absolute value between the entries \(1.943\) and \(2.447\) in the \(df = 7 - 1 = 6\) degrees of freedom row of the t-distribution table. Therefore, it lies between the critical values \(-2.447\) and \(-1.943\) on the negative side of the distribution. The probability of being further into the left-tail must therefore be somewhere between 0.025 and 0.050. Doubling these because this is a two-tailed test, the p-value for this test statistic is somewhere between 0.05 and 0.10.
We therefore will reject the null hypothesis at the \(\alpha = 0.10\) significance level, but fail to reject it at the 0.05 and 0.01 significance level. There is evidence tending towards significance that the mean lifespan of a supernova is different from 74 days, but there is not statistically significant or highly significant evidence that the mean is different from 74 days.
Example 5.5.24. Estimating p-Values for a Small Sample Difference Between Means.
You believe that moon rocks contain less iron than do earth rocks. To test this hypothesis, you collect a sample of 300 rocks from all over the earth and find that they contain an average of 6.7 grams per cubic inch of iron with a standard deviation of 2.9 grams per cubic inch. You only have access to six moon rocks, however, and find that they have an average of 2.4 grams per cubic inch with a standard deviation of 0.35 grams per cubic inch. Conduct a p-value test for this claim, giving a range into which the p-value must fall.
From previous work, the hypotheses are:
The test statistic is:
Because we have a degrees of freedom that is 30 or more, (6 + 300 - 2 = 304), we use the bottom row of the t-distribution table. The test statistic of -3.626 has an absolute value greater than the largest value of 2.576, which goes with a probability of 0.005 in the tail. Therefore, the p-value for this test statistic is smaller than 0.005 and we would reject the null hypothesis at any significance level. The evidence that moon rocks contain less iron than do rocks found on earth is extremely strong.
Checkpoint 5.5.27.
A dentist believes that professional boxers require more than the average 3 fillings found in the average person. To test this hypothesis, the dentist contacts a sample of 13 professional boxers and finds that they have an average of 4.2 fillings with a standard deviation of 1.83.
Question: the P-value for this test is no more than what value from the t-distribution table?
0.025
Checkpoint 5.5.28.
The air force is evaluating two possible jet fighters based on their top speed. They wish to decide if the top speeds of fighter A and fighter B are the same. In a sample of flights from fighters A and B the following average top speeds are observed.
Sample Size | Sample Mean | Sample St.Dev. | |
Fighter A | \(n_1 = 8\) | mach 2.89 | mach 0.19 |
Fighter B | \(n_2 = 6\) | mach 2.71 | mach 0.14 |
Question: what is the smallest upper bound on the p-value for this test from the t-distribution table?
0.100