Tests of Independence and Homogeneity

Section 6.3 Tests of Independence and Homogeneity

Tests for Contingency Tables.

We've made it to the last section in the text! In this section, we will look at one more type of hypothesis test that uses the \(\chi^2\)-distribution. This test is used to compare characteristics between two or more populations. Consider the following example.

Example 6.3.1. Explaining Independence in a Contingency Table.

A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.

	Freshman	Sophomore	Junior	Senior
Dorm	114	99	90	78
Off-Campus College-Owned Housing	42	83	104	124
Off-Campus Other Housing	16	19	17	21

Table 6.3.2. Student Housing Status by Class Standing

If class standing does not affect housing arrangements (they are independent), what would we expect to see in this table?

Solution

We would expect to see the same proportion of all four classes to be staying in the dorm, staying in off-campus college owned housing, and off-campus other housing. Note that this does not mean the numbers will all be equal, but the proportions of a row entry to the column total should be the same for every column.

Our goal in this situation is to determine if the proportion of people who live in each housing type is different within the four populations. That is, is housing type independent of class standing, or do different classes have a different distribution of housing types. Many of the same tools we used in Section 6.2 will reappear in this section. This includes the \(\chi^2\)-distribution, the notion of observed and expected counts, and the \(\chi^2\) test statistic.

Objectives

After finishing this section you should be able to

describe the following terms:
- Degrees of Freedom for a Test of Independence or Homogeneity
- Expected Counts for a Contingency Table
- Test of Homogeneity
- Test of Independence
- Test Statistic for a Test of Independence or homogeneity
accomplish the following tasks:
- Formulate hypotheses for tests of independence and homogeneity
- Compute the expected counts for a test of independence or homogeneity
- Compute the \(\chi^2\) test statistic for a contingency table
- Look up critical values in the \(\chi^2\)-distribution table
- Conduct a traditional test of independence or homogeneity

Subsection 6.3.1 Two Types of Hypotheses

In this lesson, we will actually look at two types of hypothesis tests. The only difference between these two tests is the way in which the null hypothesis is stated. Because of this, the difference may seem like a technicality. However, it is important that we recognize the difference between these two types of tests.

Definition 6.3.3.

In a \(\chi^2\) test of independence, the null hypothesis is that two characteristics in a single population are independent of each other.

Definition 6.3.4.

In a \(\chi^2\) test of homogeneity, the null hypothesis is that a single characteristic is distributed the same way across two or more populations.

The difference between these two tests has to do with the number of populations. If we are looking at a single characteristic across different populations, then we are conducting a test of homogeneity. On the other hand, if we are looking at the relationship between two characteristics in the same population, we need to conduct a test of independence.

In both cases, the alternative hypothesis is simply that the distribution is not independent or homogeneous. In the following examples, we will take a look at the hypotheses for each of these tests.

Example 6.3.5. Recognizing a Test of Homogeneity.

A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.

	Freshman	Sophomore	Junior	Senior
Dorm	114	99	90	78
Off-Campus College-Owned Housing	42	83	104	124
Off-Campus Other Housing	16	19	17	21

Table 6.3.6. Student Housing Status by Class Standing

State the null and alternative hypotheses for this test.

Solution

We are looking at a single characteristic, housing type, among the four populations of Freshmen, Sophomores, Juniors, and Seniors. Therefore, this is a test of homogeneity. The hypotheses are therefore:

\begin{align*} H_0\amp:\ \text{Housing distributions in the four populations are homogeneous}\\ H_A\amp:\ \text{Housing distributions are not homogeneous across the four populations} \end{align*}

Example 6.3.7. Recognizing a Test of Homogeneity.

An efficiency expert at a certain large company believes that the number of years experience that an employee has is independent from their efficiency rating. To test this claim, the following data is collected.

	\(\lt\) 5 Years	5-15 Years	15+ Years
Low Efficiency	21	14	11
Moderate Efficiency	45	28	24
High Efficiency	62	38	32

Table 6.3.8. Efficience Rating vs. Years of Experience

State the hypothesis for this test.

Solution

In this case, we are comparing two characteristics of the same population of employees. This is therefore a test of independence, and the hypotheses are:

\begin{align*} H_0\amp:\ \text{Efficiency and years of experience are independent characteristics}\\ H_A\amp:\ \text{These characteristics are dependent} \end{align*}

Figure 6.3.9. Two Types of Hypotheses I

Figure 6.3.10. Two Types of Hypotheses II

Checkpoint 6.3.11.

A fish and game expert wishes to determine if two species of fish that live together in a certain lake have the same proportion of adult fish, juvenile fish, and minnows, or baby fish.

Question: what hypotheses should he use?

Answer

\begin{align*} H_0\amp:\ \text{The two populations are homogeneous}\\ H_A\amp:\ \text{The populations are not homogeneous} \end{align*}

Checkpoint 6.3.12.

A social worker believes that there is a relationship between marital status: married, widowed, divorced, or never married, and happiness. To test this, she randomly selects a group of 1000 individuals and records their marital status and happiness levels.

Question: what hypotheses should she use?

Answer

\begin{align*} H_0\amp:\ \text{The characteristics are independent}\\ H_A\amp:\ \text{Marital status and happpiness are dependent} \end{align*}

Subsection 6.3.2 Observed and Expected Counts

No matter which type of hypothesis test we are doing, a test of homogeneity or a test of independence, the underlying assumption of the null hypothesis is the same. That assumption is that the rows and columns of our contingency table are independent. Put another way, the probability of being in a certain row is the same no matter what column you are in. Let's explore that idea with the following table.

We will use the following notation.

\(O_{ij}\) represent the observed counts in the \(i^\text{th}\) column and \(j^\text{th}\) row. So, for example, \(O_{32}\) is the number of individuals in the 3rd column, 2nd row.
\(R_j\) stands for the row total of the \(j^\text{th}\) row. That means, \(R_1 = O_{11} + O_{21} + O_{31}\) in the table shown.
\(C_i\) stands for the column total of the \(i^\text{th}\) column. So \(C_1 = O_{11} + O_{12}\) in this table.
Finally, \(T\) stands for the grand total. In our table,

\begin{equation*} T = C_1 + C_2 + C_3 = R_1 + R_2\text{.} \end{equation*}

	col 1	col 2	col 3	total
row 1	\(O_{11}\)	\(O_{21}\)	\(O_{31}\)	\(R_1\)
row 2	\(O_{12}\)	\(O_{22}\)	\(O_{32}\)	\(R_2\)
total	\(C_1\)	\(C_2\)	\(C_3\)	\(T\)

Table 6.3.13.

Our goal is to find out what counts we would expect in each entry if we assume that the row and column probabilities are independent. Recall that we saw a test for independence in Section 2.4. It stated that two events, \(A\) and \(B\text{,}\) are independent if and only if

\begin{equation*} P(A\cap B) = P(A)P(B)\text{.} \end{equation*}

Suppose that \(A\) is the event of being in column \(i\) and \(B\) is the event of being in column \(j\text{.}\) Using the table above, we can figure out what these probabilities are.

\begin{align*} P(A) \amp = P(\text{column } i) = \frac{\text{number in column } i}{\text{total in table}} = \frac{C_i}{T}\\ P(B) \amp = P(\text{row } j) = \frac{\text{number in row } j}{\text{total in table}} = \frac{R_j}{T} \end{align*}

If these two events are independent, then the probability of being in the single cell in both row \(i\) and column \(j\) should be

\begin{equation*} P(\text{column } i \cap \text{row } j) = P(A \cap B) = P(A)P(B) = \left(\frac{C_i}{T}\right)\left(\frac{R_j}{T}\right) = \frac{C_iR_j}{T^2}\text{.} \end{equation*}

This gives us the probability of being in the cell in column \(i\) and row \(j\text{.}\) To get the expected number of counts, we multiply that probability by the total number of individuals, \(T\text{.}\) This gives the following formulas for computing the expected counts in a contingency table.

Theorem 6.3.14. Expected Counts for a Contingency Table.

The expected count for a cell in column \(i\) and row \(j\) of a contingency table under the assumption that rows and columns are independent is given by:

\begin{equation*} E_{ij} = \frac{C_iR_j}{T} \end{equation*}

where \(C_i\) is the column total, \(R_j\) is the row total, and \(T\) is the grand total in the table.

Let's practice finding the expected counts using one of our previously seen examples.

Example 6.3.15. Finding Expected Counts for a Contingency Table.

A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.

	Freshman	Sophomore	Junior	Senior
Dorm	114	99	90	78
Off-Campus College-Owned Housing	42	83	104	124
Off-Campus Other Housing	16	19	17	21

Table 6.3.16. Student Housing Status by Class Standing

Find the expected counts for each entry in this table.

Solution

In order to compute the expected counts, we follow the process outlined below.

First compute the marginal distributions (row and column totals).
Next, use the formula for expected counts above for each cell, taking the product of its row and column totals divided by the grand total.
Finally, check our work by verifying that the row and column totals of the expected counts match the row and column totals of the observed counts, ignoring any rounding error.

While we won't go through the details of the computation for every cell, the the expected number of Juniors in College Owned housing is found as follows:

\begin{equation*} E_{23} = \frac{211 \times 353}{807} \approx 92.30\text{.} \end{equation*}

The rest of the table is filled in as shown below.

	Freshman		Sophomore		Junior		Senior		Total
Dorm	114		99		90		78		381
		81.20		94.90		99.62		105.28
College Owned	42		83		104		124		353
		75.24		87.92		92.30		97.55
Other	16		19		17		21		73
		15.56		18.18		19.09		20.17
Total	172		201		211		223		807

Table 6.3.17. Observed and Expected Counts

Pay special attention to the third step in the process outlined in the example above. The sum of the observed and expected counts in any row or columnn should be the same. In fact, this can be used to help reduce the number of computations you have to perform. If you find the expected count in all but one of the cells in a row or column, you can subtract those expected count from the row or column total, and what is left must be the expected count in the final cell.

Figure 6.3.18. Computing Expected Counts I

Figure 6.3.19. Computing Expected Counts II

Checkpoint 6.3.20.

In a test of independence, the following observed counts are recorded.

	Characteristic 1	Characteristic 2
Characteristic A	24	49
Characteristic B	46	85
Characteristic C	31	62

Table 6.3.21. Observed Counts

Question: What is the expected count for the entry in characteristic B, characteristic 2?

Answer

86.45

Checkpoint 6.3.22.

In a test of homogeneity, the following observed counts are recorded.

	Population 1	Population 2
Characteristic A	24	49
Characteristic B	46	85
Characteristic C	31	62

Table 6.3.23. Observed Counts

Question: what is the expected count for the entry in characteristic C, population 1?

Answer

31.63

Subsection 6.3.3 The Test Statistic

As in the goodness-of-fit test, a test of homogeneity or independence is based on measuring the differences between what we observed in our sample, and what we expected if the null hypothesis is really true. We just saw how to use the observed counts in a ontingency table to find the expected counts for each entry in that table. The test statistic is then computed as shown below.

Theorem 6.3.24. Test Statistic for a Test of Independence or Homogeneity.

The \(\chi^2\) test statistic for a test of independence or homogeneity using a contingency table in which \(O_{ij}\) and \(E_{ij}\) represent the observed and expected counts in the cell in column \(i\) and row \(j\) is given by:

\begin{equation*} \chi^2_\text{test} = \sum_j\sum_i \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\text{.} \end{equation*}

Don't let the double summation sign confuse you. All that this means is that we add together the observed count minus the expected count squared divided by the expected count for every cell in the contingency table. Below we perform this computation on the table we found in Example 6.3.15.

Example 6.3.25. Computing the Test Statistic for a Test of Homogeneity.

A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.

	Freshman	Sophomore	Junior	Senior
Dorm	114	99	90	78
Off-Campus College-Owned Housing	42	83	104	124
Off-Campus Other Housing	16	19	17	21

Table 6.3.26. Student Housing Status by Class Standing

Compute the test statistic for this test.

Solution

Recall that the observed and expected counts were as follows.

	Freshman		Sophomore		Junior		Senior		Total
Dorm	114		99		90		78		381
		81.20		94.90		99.62		105.28
College Owned	42		83		104		124		353
		75.24		87.92		92.30		97.55
Other	16		19		17		21		73
		15.56		18.18		19.09		20.17
Total	172		201		211		223		807

Table 6.3.27. Observed and Expected Counts

To compute the test statistic, we must step through each entry in the table, adding the appropriate values.

\begin{align*} \chi^2_{\text{test}} \amp= \frac{(114-81.20)^2}{81.20} + \frac{(99-94.90)^2}{94.90} + \frac{(90-99.62)^2}{99.62} + \frac{(78-105.28)^2}{105.28}\\ \amp+ \frac{(42-75.24)^2}{75.24} + \frac{(83-87.92)^2}{87.92} + \frac{(104-92.30)^2}{92.30} + \frac{(124-97.55)^2}{97.55}\\ \amp+ \frac{(16-15.56)^2}{15.56} + \frac{(19-18.18)^2}{18.18} + \frac{(17-19.09)^2}{19.09} + \frac{(21 - 20.17)^2}{20.17}\\ \amp\approx 45.35. \end{align*}

Figure 6.3.28. Test Statistic I

Figure 6.3.29. Test Statistic II

Checkpoint 6.3.30.

In a test of homogeneity, the following observed counts are measured and the expected counts are shown in the lower right-hand corners of each cell.

	Population 1		Population 2
Characteristic A	114		99
		98.31		114.69
Characteristic B	42		83
		57.69		67.31

Table 6.3.31. Observed and Expected Counts

Question: what is the test statistic for this test? Round your answer to three decimal places.

Answer

12.575

Checkpoint 6.3.32.

In a test of independence, the following observed counts are observed, and the expected counts are shown in the lower right-hand corners of each cell.

	Characteristic 1		Characteristic 2
Characteristic A	27		37
		28.35		35.65
Characteristic B	43		51
		41.65		52.35

Table 6.3.33. Observed and Expected Counts

Question: what is the test statistic for this test? Round your answer to three decimal places.

Answer

0.194

Subsection 6.3.4 Traditional Tests

The last step in conducting a test of homogeneity or independence is to find the appropriate critical value from the \(\chi^2\)-distribution table. To do that, we need the number of degrees of freedom. Because we are dealing with contingency tables, instead of the categories we saw in the goodness-of-fit test, the degrees of freedom must be computed differently.

Theorem 6.3.34. Degrees of Freedom for a Test of Independence or Homogeneity.

The degrees of freedom for a test of independence or homogeneity using a contingency table with \(c\) columns and \(r\) rows is \(df = (c-1)(r-1)\text{.}\)

Thus, if we take one less than the number of columns times one less than the number of rows, we have the degrees of freedom to use when looking up our critical value. This process is shown in the following examples.

Example 6.3.35. Conducting a Test of Homogeneity.

A student-body representative wishes to determine if a student's class standing affects their housing arrangements. To test this, the following data is collected.

	Freshman	Sophomore	Junior	Senior
Dorm	114	99	90	78
Off-Campus College-Owned Housing	42	83	104	124
Off-Campus Other Housing	16	19	17	21

Table 6.3.36. Student Housing Status by Class Standing

Use a \(\chi^2\) test at the \(\alpha = 0.01\) significance level to make this determination.

Solution

Recall that our hypotheses are:

\begin{align*} H_0\amp:\ \text{Housing distributions in the four populations are homogeneous}\\ H_A\amp:\ \text{Housing distributions are not homogeneous across the four populations} \end{align*}

which means that the alternative is that they are not homogeneous, or in other words, that students in different class standings have different types of housing.

Our test statistic from Example 6.3.25 was:

\begin{equation*} \chi^2_{\text{test}} = 45.35\text{.} \end{equation*}

At the 0.01 significance level, with \((4-1)(3-1) = 6\) degrees of freedom, the critical value from the \(\chi^2\)-distribution table is \(\chi^2_{0.01} = 16.812\) as shown below.

Figure 6.3.37. Rejection Region for \(\chi^2\) Test

Since our test statistic is further into the right tail than the critical value of 16.812, we must reject the null hypothesis. There is highly significant evidence that students with different class standing live in different types of housing.

Example 6.3.38. Conducting a Test of Independence.

	\(\lt\) 5 Years	5-15 Years	15+ Years
Low Efficiency	21	14	11
Moderate Efficiency	45	28	24
High Efficiency	62	38	32

Table 6.3.39. Efficience Rating vs. Years of Experience

Conduct a test of independence at the \(\alpha = 0.05\) significance level.

Solution

You may recall that the hypotheses for this test are:

\begin{align*} H_0\amp:\ \text{Efficiency and years of experience are independent characteristics}\\ H_A\amp:\ \text{These characteristics are dependent} \end{align*}

Next, we compute the expected counts.

	\(\lt\) 5 Years		5-15 Years		15+ Years		Total
Low Efficiency	20		15		11		46
		21.41		13.38		11.20
Moderate Efficiency	43		27		27		97
		45.15		28.22		23.63
High Efficiency	65		38		29		132
		61.44		38.40		32.16
Total	128		80		67		275

Table 6.3.40. Observed and Expected Counts

And then we compute the \(\chi^2\) test statistic using this table.

\begin{align*} \chi^2_{\text{test}} \amp= \frac{(20-21.41)^2}{21.41} + \frac{(15-13.38)^2}{13.38} + \frac{(11-11.20)^2}{11.20}\\ \amp+ \frac{(43-45.15)^2}{45.15} + \frac{(27-28.22)^2}{28.22} + \frac{(27-23.63)^2}{23.63}\\ \amp+ \frac{(65-61.44)^2}{61.44} + \frac{(38.38.40)^2}{38.40} + \frac{(29-32.16)^2}{32.16}\\ \amp\approx 1.449 \end{align*}

At the \(\alpha = 0.05\) significance level with \(df = (3-1)(3-1) = 4\text{,}\) the critical value from the \(\chi^2\)-distribution is 9.488. The distribution and test statistic are shown below.

Figure 6.3.41. Rejection Region for \(\chi^2\) Test

Since our test statistic is not further into the tail than the critical value of \(\chi^2_{0.05} = 9.488\text{,}\) we must fail to reject the null hypothesis. There is no statistically significant evidence that efficiency and years worked are related. These two variables appear to be independent.

Figure 6.3.42. Tests of Independence or Homogeneity I

Figure 6.3.43. Tests of Independence or Homogeneity II

Checkpoint 6.3.44.

In a test of homogeneity, the following observed counts are observed, and the expected counts are shown in the lower right-hand corners of each cell.

	Population 1		Population 2
Characteristic A	51		42
		43.22		49.78
Characteristic B	61		87
		68.78		79.22

Table 6.3.45. Observed and Expected Counts

This results in a test statistic \(\chi^2_\text{test} = 4.260\text{.}\)

Question: what is your conclusion at the \(\alpha = 0.01\) significance level?

Answer

The populations appear to be homogeneous.

Checkpoint 6.3.46.

In a test of independence, the following observed counts are observed, and the expected counts are shown in the lower right-hand corners of each cell.

	Characteristic 1		Characteristic 2
Characteristic A	51		42
		43.22		49.78
Characteristic B	61		87
		68.78		79.22

Table 6.3.47. Observed and Expected Counts

This results in a test statistic \(\chi^2_{\text{test}} = 4.260\text{.}\)

Question: what is your conclusion at the \(\alpha = 0.05\) significance level?

Answer

There is statistically significant evidence the characteristics are dependent.