Section 6.3 Tests of Independence and Homogeneity
¶Tests for Contingency Tables.
We've made it to the last section in the text! In this section, we will look at one more type of hypothesis test that uses the \(\chi^2\)-distribution. This test is used to compare characteristics between two or more populations. Consider the following example.
Example 6.3.1. Explaining Independence in a Contingency Table.
A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.
Freshman | Sophomore | Junior | Senior | |
Dorm | 114 | 99 | 90 | 78 |
Off-Campus College-Owned Housing | 42 | 83 | 104 | 124 |
Off-Campus Other Housing | 16 | 19 | 17 | 21 |
If class standing does not affect housing arrangements (they are independent), what would we expect to see in this table?
We would expect to see the same proportion of all four classes to be staying in the dorm, staying in off-campus college owned housing, and off-campus other housing. Note that this does not mean the numbers will all be equal, but the proportions of a row entry to the column total should be the same for every column.
Our goal in this situation is to determine if the proportion of people who live in each housing type is different within the four populations. That is, is housing type independent of class standing, or do different classes have a different distribution of housing types. Many of the same tools we used in Section 6.2 will reappear in this section. This includes the \(\chi^2\)-distribution, the notion of observed and expected counts, and the \(\chi^2\) test statistic.
Objectives
After finishing this section you should be able to
-
describe the following terms:
Degrees of Freedom for a Test of Independence or Homogeneity
Expected Counts for a Contingency Table
Test of Homogeneity
Test of Independence
Test Statistic for a Test of Independence or homogeneity
-
accomplish the following tasks:
Formulate hypotheses for tests of independence and homogeneity
Compute the expected counts for a test of independence or homogeneity
Compute the \(\chi^2\) test statistic for a contingency table
Look up critical values in the \(\chi^2\)-distribution table
Conduct a traditional test of independence or homogeneity
Subsection 6.3.1 Two Types of Hypotheses
¶In this lesson, we will actually look at two types of hypothesis tests. The only difference between these two tests is the way in which the null hypothesis is stated. Because of this, the difference may seem like a technicality. However, it is important that we recognize the difference between these two types of tests.
Definition 6.3.3.
In a \(\chi^2\) test of independence, the null hypothesis is that two characteristics in a single population are independent of each other.
Definition 6.3.4.
In a \(\chi^2\) test of homogeneity, the null hypothesis is that a single characteristic is distributed the same way across two or more populations.
The difference between these two tests has to do with the number of populations. If we are looking at a single characteristic across different populations, then we are conducting a test of homogeneity. On the other hand, if we are looking at the relationship between two characteristics in the same population, we need to conduct a test of independence.
In both cases, the alternative hypothesis is simply that the distribution is not independent or homogeneous. In the following examples, we will take a look at the hypotheses for each of these tests.
Example 6.3.5. Recognizing a Test of Homogeneity.
A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.
Freshman | Sophomore | Junior | Senior | |
Dorm | 114 | 99 | 90 | 78 |
Off-Campus College-Owned Housing | 42 | 83 | 104 | 124 |
Off-Campus Other Housing | 16 | 19 | 17 | 21 |
State the null and alternative hypotheses for this test.
We are looking at a single characteristic, housing type, among the four populations of Freshmen, Sophomores, Juniors, and Seniors. Therefore, this is a test of homogeneity. The hypotheses are therefore:
Example 6.3.7. Recognizing a Test of Homogeneity.
An efficiency expert at a certain large company believes that the number of years experience that an employee has is independent from their efficiency rating. To test this claim, the following data is collected.
\(\lt\) 5 Years | 5-15 Years | 15+ Years | |
Low Efficiency | 21 | 14 | 11 |
Moderate Efficiency | 45 | 28 | 24 |
High Efficiency | 62 | 38 | 32 |
State the hypothesis for this test.
In this case, we are comparing two characteristics of the same population of employees. This is therefore a test of independence, and the hypotheses are:
Checkpoint 6.3.11.
A fish and game expert wishes to determine if two species of fish that live together in a certain lake have the same proportion of adult fish, juvenile fish, and minnows, or baby fish.
Question: what hypotheses should he use?
Checkpoint 6.3.12.
A social worker believes that there is a relationship between marital status: married, widowed, divorced, or never married, and happiness. To test this, she randomly selects a group of 1000 individuals and records their marital status and happiness levels.
Question: what hypotheses should she use?
Subsection 6.3.2 Observed and Expected Counts
¶No matter which type of hypothesis test we are doing, a test of homogeneity or a test of independence, the underlying assumption of the null hypothesis is the same. That assumption is that the rows and columns of our contingency table are independent. Put another way, the probability of being in a certain row is the same no matter what column you are in. Let's explore that idea with the following table.
We will use the following notation.
\(O_{ij}\) represent the observed counts in the \(i^\text{th}\) column and \(j^\text{th}\) row. So, for example, \(O_{32}\) is the number of individuals in the 3rd column, 2nd row.
\(R_j\) stands for the row total of the \(j^\text{th}\) row. That means, \(R_1 = O_{11} + O_{21} + O_{31}\) in the table shown.
\(C_i\) stands for the column total of the \(i^\text{th}\) column. So \(C_1 = O_{11} + O_{12}\) in this table.
-
Finally, \(T\) stands for the grand total. In our table,
\begin{equation*} T = C_1 + C_2 + C_3 = R_1 + R_2\text{.} \end{equation*}
col 1 | col 2 | col 3 | total | |
row 1 | \(O_{11}\) | \(O_{21}\) | \(O_{31}\) | \(R_1\) |
row 2 | \(O_{12}\) | \(O_{22}\) | \(O_{32}\) | \(R_2\) |
total | \(C_1\) | \(C_2\) | \(C_3\) | \(T\) |
Our goal is to find out what counts we would expect in each entry if we assume that the row and column probabilities are independent. Recall that we saw a test for independence in Section 2.4. It stated that two events, \(A\) and \(B\text{,}\) are independent if and only if
Suppose that \(A\) is the event of being in column \(i\) and \(B\) is the event of being in column \(j\text{.}\) Using the table above, we can figure out what these probabilities are.
If these two events are independent, then the probability of being in the single cell in both row \(i\) and column \(j\) should be
This gives us the probability of being in the cell in column \(i\) and row \(j\text{.}\) To get the expected number of counts, we multiply that probability by the total number of individuals, \(T\text{.}\) This gives the following formulas for computing the expected counts in a contingency table.
Theorem 6.3.14. Expected Counts for a Contingency Table.
The expected count for a cell in column \(i\) and row \(j\) of a contingency table under the assumption that rows and columns are independent is given by:
where \(C_i\) is the column total, \(R_j\) is the row total, and \(T\) is the grand total in the table.
Let's practice finding the expected counts using one of our previously seen examples.
Example 6.3.15. Finding Expected Counts for a Contingency Table.
A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.
Freshman | Sophomore | Junior | Senior | |
Dorm | 114 | 99 | 90 | 78 |
Off-Campus College-Owned Housing | 42 | 83 | 104 | 124 |
Off-Campus Other Housing | 16 | 19 | 17 | 21 |
Find the expected counts for each entry in this table.
In order to compute the expected counts, we follow the process outlined below.
First compute the marginal distributions (row and column totals).
Next, use the formula for expected counts above for each cell, taking the product of its row and column totals divided by the grand total.
Finally, check our work by verifying that the row and column totals of the expected counts match the row and column totals of the observed counts, ignoring any rounding error.
While we won't go through the details of the computation for every cell, the the expected number of Juniors in College Owned housing is found as follows:
The rest of the table is filled in as shown below.
Freshman | Sophomore | Junior | Senior | Total | |||||
Dorm | 114 | 99 | 90 | 78 | 381 | ||||
81.20 | 94.90 | 99.62 | 105.28 | ||||||
College Owned | 42 | 83 | 104 | 124 | 353 | ||||
75.24 | 87.92 | 92.30 | 97.55 | ||||||
Other | 16 | 19 | 17 | 21 | 73 | ||||
15.56 | 18.18 | 19.09 | 20.17 | ||||||
Total | 172 | 201 | 211 | 223 | 807 |
Pay special attention to the third step in the process outlined in the example above. The sum of the observed and expected counts in any row or columnn should be the same. In fact, this can be used to help reduce the number of computations you have to perform. If you find the expected count in all but one of the cells in a row or column, you can subtract those expected count from the row or column total, and what is left must be the expected count in the final cell.
Checkpoint 6.3.20.
In a test of independence, the following observed counts are recorded.
Characteristic 1 | Characteristic 2 | |
Characteristic A | 24 | 49 |
Characteristic B | 46 | 85 |
Characteristic C | 31 | 62 |
Question: What is the expected count for the entry in characteristic B, characteristic 2?
86.45
Checkpoint 6.3.22.
In a test of homogeneity, the following observed counts are recorded.
Population 1 | Population 2 | |
Characteristic A | 24 | 49 |
Characteristic B | 46 | 85 |
Characteristic C | 31 | 62 |
Question: what is the expected count for the entry in characteristic C, population 1?
31.63
Subsection 6.3.3 The Test Statistic
¶As in the goodness-of-fit test, a test of homogeneity or independence is based on measuring the differences between what we observed in our sample, and what we expected if the null hypothesis is really true. We just saw how to use the observed counts in a ontingency table to find the expected counts for each entry in that table. The test statistic is then computed as shown below.
Theorem 6.3.24. Test Statistic for a Test of Independence or Homogeneity.
The \(\chi^2\) test statistic for a test of independence or homogeneity using a contingency table in which \(O_{ij}\) and \(E_{ij}\) represent the observed and expected counts in the cell in column \(i\) and row \(j\) is given by:
Don't let the double summation sign confuse you. All that this means is that we add together the observed count minus the expected count squared divided by the expected count for every cell in the contingency table. Below we perform this computation on the table we found in Example 6.3.15.
Example 6.3.25. Computing the Test Statistic for a Test of Homogeneity.
A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.
Freshman | Sophomore | Junior | Senior | |
Dorm | 114 | 99 | 90 | 78 |
Off-Campus College-Owned Housing | 42 | 83 | 104 | 124 |
Off-Campus Other Housing | 16 | 19 | 17 | 21 |
Compute the test statistic for this test.
Recall that the observed and expected counts were as follows.
Freshman | Sophomore | Junior | Senior | Total | |||||
Dorm | 114 | 99 | 90 | 78 | 381 | ||||
81.20 | 94.90 | 99.62 | 105.28 | ||||||
College Owned | 42 | 83 | 104 | 124 | 353 | ||||
75.24 | 87.92 | 92.30 | 97.55 | ||||||
Other | 16 | 19 | 17 | 21 | 73 | ||||
15.56 | 18.18 | 19.09 | 20.17 | ||||||
Total | 172 | 201 | 211 | 223 | 807 |
To compute the test statistic, we must step through each entry in the table, adding the appropriate values.
Checkpoint 6.3.30.
In a test of homogeneity, the following observed counts are measured and the expected counts are shown in the lower right-hand corners of each cell.
Population 1 | Population 2 | |||
Characteristic A | 114 | 99 | ||
98.31 | 114.69 | |||
Characteristic B | 42 | 83 | ||
57.69 | 67.31 |
Question: what is the test statistic for this test? Round your answer to three decimal places.
12.575
Checkpoint 6.3.32.
In a test of independence, the following observed counts are observed, and the expected counts are shown in the lower right-hand corners of each cell.
Characteristic 1 | Characteristic 2 | |||
Characteristic A | 27 | 37 | ||
28.35 | 35.65 | |||
Characteristic B | 43 | 51 | ||
41.65 | 52.35 |
Question: what is the test statistic for this test? Round your answer to three decimal places.
0.194
Subsection 6.3.4 Traditional Tests
¶The last step in conducting a test of homogeneity or independence is to find the appropriate critical value from the \(\chi^2\)-distribution table. To do that, we need the number of degrees of freedom. Because we are dealing with contingency tables, instead of the categories we saw in the goodness-of-fit test, the degrees of freedom must be computed differently.
Theorem 6.3.34. Degrees of Freedom for a Test of Independence or Homogeneity.
The degrees of freedom for a test of independence or homogeneity using a contingency table with \(c\) columns and \(r\) rows is \(df = (c-1)(r-1)\text{.}\)
Thus, if we take one less than the number of columns times one less than the number of rows, we have the degrees of freedom to use when looking up our critical value. This process is shown in the following examples.
Example 6.3.35. Conducting a Test of Homogeneity.
A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.
Freshman | Sophomore | Junior | Senior | |
Dorm | 114 | 99 | 90 | 78 |
Off-Campus College-Owned Housing | 42 | 83 | 104 | 124 |
Off-Campus Other Housing | 16 | 19 | 17 | 21 |
Use a \(\chi^2\) test at the \(\alpha = 0.01\) significance level to make this determination.
Recall that our hypotheses are:
which means that the alternative is that they are not homogeneous, or in other words, that students in different class standings have different types of housing.
Our test statistic from Example 6.3.25 was:
At the 0.01 significance level, with \((4-1)(3-1) = 6\) degrees of freedom, the critical value from the \(\chi^2\)-distribution table is \(\chi^2_{0.01} = 16.812\) as shown below.
Since our test statistic is further into the right tail than the critical value of 16.812, we must reject the null hypothesis. There is highly significant evidence that students with different class standing live in different types of housing.
Example 6.3.38. Conducting a Test of Independence.
An efficiency expert at a certain large company believes that the number of years experience that an employee has is independent from their efficiency rating. To test this claim, the following data is collected.
\(\lt\) 5 Years | 5-15 Years | 15+ Years | |
Low Efficiency | 21 | 14 | 11 |
Moderate Efficiency | 45 | 28 | 24 |
High Efficiency | 62 | 38 | 32 |
Conduct a test of independence at the \(\alpha = 0.05\) significance level.
You may recall that the hypotheses for this test are:
Next, we compute the expected counts.
\(\lt\) 5 Years | 5-15 Years | 15+ Years | Total | ||||
Low Efficiency | 20 | 15 | 11 | 46 | |||
21.41 | 13.38 | 11.20 | |||||
Moderate Efficiency | 43 | 27 | 27 | 97 | |||
45.15 | 28.22 | 23.63 | |||||
High Efficiency | 65 | 38 | 29 | 132 | |||
61.44 | 38.40 | 32.16 | |||||
Total | 128 | 80 | 67 | 275 |
And then we compute the \(\chi^2\) test statistic using this table.
At the \(\alpha = 0.05\) significance level with \(df = (3-1)(3-1) = 4\text{,}\) the critical value from the \(\chi^2\)-distribution is 9.488. The distribution and test statistic are shown below.
Since our test statistic is not further into the tail than the critical value of \(\chi^2_{0.05} = 9.488\text{,}\) we must fail to reject the null hypothesis. There is no statistically significant evidence that efficiency and years worked are related. These two variables appear to be independent.
Checkpoint 6.3.44.
In a test of homogeneity, the following observed counts are observed, and the expected counts are shown in the lower right-hand corners of each cell.
Population 1 | Population 2 | |||
Characteristic A | 51 | 42 | ||
43.22 | 49.78 | |||
Characteristic B | 61 | 87 | ||
68.78 | 79.22 |
This results in a test statistic \(\chi^2_\text{test} = 4.260\text{.}\)
Question: what is your conclusion at the \(\alpha = 0.01\) significance level?
The populations appear to be homogeneous.
Checkpoint 6.3.46.
In a test of independence, the following observed counts are observed, and the expected counts are shown in the lower right-hand corners of each cell.
Characteristic 1 | Characteristic 2 | |||
Characteristic A | 51 | 42 | ||
43.22 | 49.78 | |||
Characteristic B | 61 | 87 | ||
68.78 | 79.22 |
This results in a test statistic \(\chi^2_{\text{test}} = 4.260\text{.}\)
Question: what is your conclusion at the \(\alpha = 0.05\) significance level?
There is statistically significant evidence the characteristics are dependent.