Skip to main content

Section 1.2 Describing Data Graphically

Once we have collected our data, we need to be able to understand it. Data in its raw form--a long list of values--is not very informative. In this section we will explore ways in which we can organize data in tables and describe the values using graphs. The graph we choose to use will depend on the data we are describing as well as other factors. Some of these factors are:

  • Does our data come from a quantitative variable or a qualitative variable?

  • Is our data univariate or bivariate?

  • Do we want a "quick-and-dirty" graph we can sketch by hand, or will a computer be drawing the graph for us?

  • Does the data represent the value of a variable over time, or the value of a variable from many different subjects?

By answering these questions we will learn to choose and construct an appropriate graph to describe many different data sets. We will also learn to interpret these graphs as a preliminary step in analyzing our data.

Subsection 1.2.1 Qualitative Data

Recall that a qualitative variable produces data that represents categories, and not numerical values. Consider the following example.

To better understand their customers, a pizza parlor decides to collect a list of favorite pizza toppings. They choose an appropriate sampling technique and gather the following data from 96 customers. What does this data tell you?

cheese green peppers olives cheese
olives cheese cheese cheese
green peppers cheese pepperoni olives
pineapple onions cheese sausage
onions anchovies olives cheese
green peppers cheese cheese cheese
pineapple mushrooms green peppers pineapple
cheese green peppers sausage cheese
sausage bacon bacon green peppers
cheese pepperoni pepperoni olives
pepperoni pepperoni green peppers pepperoni
olives olives olives green peppers
pepperoni pepperoni mushrooms onions
bacon sausage sausage olives
sausage green peppers olives bacon
sausage pepperoni pineapple olives
mushrooms bacon onions bacon
olives sausage mushrooms cheese
pineapple bacon green peppers pepperoni
olives pepperoni pepperoni pepperoni
sausage sausage pepperoni pepperoni
cheese bacon onions pepperoni
pineapple olives cheese bacon
green peppers bacon pepperoni olives
Table 1.2.2. Raw Pizza Topping Data
Solution

Without some organization, it is hard to glean anything from this data. We need a way to simplify things!

Subsubsection 1.2.1.1 Frequency Tables

Our first step in constructing a graph for this data is to summarize. With any qualitative variable, we can construct what is called a frequency table to summarize the values in our data set.

Definition 1.2.3.

A frequency table is a two column table used to summarize the values of a qualitative variable. The first column contains the various categories that the variable can take on. The second column contains the counts or frequency with which each category appears.

To see how this works, let's use a frequency table to summarize the pizza topping data set we saw in our first example.

Construct a frequency table for the pizza topping data seen in the example above. Does this table give you a better picture of the data set?

Solution

Steps:

  1. List all of the possible pizza toppings in the first column of our table.

  2. Count the frequency with which each topping appears in the original data.

  3. Check to ensure that the total of our counts adds up to the total number of values in the set of data.

This definitely gives us a clearer picture of the data set. We can now see the number of times each topping was selected as somebody's favorite.

Topping Frequency
Anchovies 1
Bacon 10
Cheese 17
Green Peppers 11
Mushrooms 4
Olives 15
Onions 5
Pepperoni 17
Pineapple 6
Sausage 10
Total: 96
Table 1.2.5. Pizza Toppings Frequency Table

While this is a great way to summarize data, it can still be hard to tell how popular categories are in relation to one another. In the previous example, there is a difference of two between the number of customers who favor cheese and the number who favor olives. Is that a large difference? To better analyze this, we can change from using counts to using percents.

Definition 1.2.6. Relative Frequency Table.

A relative frequency table is a table in which each category is listed together with the ratio of the total values from the data set that fall into that category.

Let's revisit the pizza topping data once again, this time constructing a relative frequency table.

Construct a relative frequency table for the pizza topping data seen in Example 1.2.1.

Solution

Steps:

  1. Start with the frequency table from the previous example.

  2. Divide each category frequency by the grand total (in this case 96).

  3. Check to ensure that the sum is 1. Note that in this case, the extra 0.0002 is a result of rounding error in our four-decimal-place answers.

With this relative frequency table, we can more easily tell how each category relates to the others.

Topping Relative Frequency
Anchovies \(\frac{1}{96} \approx 0.0104\)
Bacon \(\frac{10}{96} \approx 0.1042\)
Cheese \(\frac{17}{96} \approx 0.1771\)
Green Peppers \(\frac{11}{96} \approx 0.1146\)
Mushrooms \(\frac{4}{96} \approx 0.0417\)
Olives \(\frac{15}{96} \approx 0.1563\)
Onions \(\frac{5}{96} \approx 0.0521\)
Pepperoni \(\frac{17}{96} \approx 0.1771\)
Pineapple \(\frac{6}{96} \approx 0.0625\)
Sausage \(\frac{10}{96} \approx 0.1042\)
Total: \(1.0002\)
Table 1.2.8. Pizza Toppings Relative Frequency Table

While these tables are a good step in better displaying our data, we still need to translate this into a picture. To see how we do that, continue on to the next section.

Subsubsection 1.2.1.2 Graphs

One of the most basic ways to graphically represent data collected for a qualitative variable is using a bar graph.

Definition 1.2.9. Bar Graph.

A bar graph is a graph for qualitative data in which each possible category is represented by a bar, the height of which corresponds to the frequency of that category in the data set. These bars can be arranged in any order and are separated from each other by empty space.

Note that the first step in creating a bar chart from raw data is to construct a frequency table. Once that is done, a computer spreadsheet program such as OpenOffice or Microsoft Excel is a great way to quickly construct bar graphs. Consider the following example, based on our pizza topping dats.

What happens to the bar graph if we use the relative frequency table instead of the frequency table? We do get a new type of graph, but as we shall see, the shape of the graph will not change, just the scale. First, the definition.

Definition 1.2.12.

A relative frequency bar graph is a graph for qualitative data in which each category is represented by a bar, the height of which corresponds to the percent of the data values that are in that category. These bars can be arranged in any order and are separated from each other by empty space.

Below we construct a relative frequency bar graph for the pizza topping example.

Construct a bar graph for the pizza topping data seen in Example 1.2.1.

Solution

Based on the relative frequency table, we get the following bar graph.

A bar for each pizza topping with heights corresponding to the to relative frequency of the topping in our relative frequency table.
Figure 1.2.14. Pizza Topping Relative Frequency Bar Graph

Notice that the bar graph is scaled in terms of percent instead of the relative frequencies from the table. This is a common quirk with using spreadsheet programs to produce such graphs.

Both the bar graph and relative frequency bar graph are particularly useful for comparing the number or percent of values that are in one category with the number or proportion in another category. However it is sometimes better to show how each category frequency or proportion relates to the whole. For this type of situation, a pie graph is more useful.

Definition 1.2.15.

A pie graph is a graph for qualitative data in which a circle is divided into sectors or wedges representing the various categories. The area of these sectors is determined by the relative frequency of the associated category.

With the same spreadsheet program we used to build our bar graphs, we can also construct pie graphs. To see how this works, we will revisit our pizza topping example one last time.

Construct a pie graph for the pizza topping data seen in Example 1.2.1.

Solution

Based on the relative frequency table previously constructed for this data, we get the following pie graph.

A circle divided into ten sectors with sizes proportional to the relative frequency of the pizza toppings they represent.
Figure 1.2.17. Pizza Topping Pie Graph

Again, spreadsheet programs often convert relative frequencies into percents when constructing a pie graph.

There is one more situation involving qualitative variables which we must investigate.

Subsubsection 1.2.1.3 Bivariate Data

The frequency tables and graphs that we've looked at so far work well for univariate data. But they need to be modified before they will work with data involving more than one variable. Multivariate data can be very difficult to summarize and display graphically. Because of this, we usually just look at each variable separately. However when we have two variables (bivariate data) there are some special techniques that we can use to summarize and display the data.

Definition 1.2.18.

A contingency table, also called a two-way table, is a table used to summarize bivariate qualitative data. The columns in this table represent the various categories of one of the two variables and the rows represent the categories of the other variable. In each cell (row-column combination) we record the number of subjects with that row category and column category.

To see how a contingency table can be read, let's look at an example involving college students and summer vacation plans.

Four hundred randomly selected college students are asked to give their class standing (Freshman, Sophomore, Junior, or Senior) and their summer plans (work, play, or study). The following contingency table summarizes the results of this survey.

Determinethe number of Freshmen who are not working this summer.

Freshman Sophomore Junior Senior
Work 23 35 30 51
Play 49 62 29 17
Study 28 17 46 13
Table 1.2.20. Class Standing and Summer Plans
Solution

The column labeled Freshmen has three rows corresponding to work, play, and study. To get the number of Freshmen who are not working, we add together those who areplaying and those who are studying which gives us \(49+28 = 77\text{.}\)

When we create a contingency table, or if we are just given a contingency table, we sometimes will want to look at each variable separately. In the example above, we may want to know how many of these students intend to work this summer. To answer this question, we need to total up the rows and/or columns. These totals are typically displayed in the margins of the table, leading to the following definition.

Definition 1.2.21.

A marginal distribution of a contingency table is either the sums of the rows of that contingency table, or the sum of the columns of that contingency table.

Let's find the marginal distributions for the student summer plans example above.

Construct the marginal distributions for the contingency table seen in Example 1.2.19. Use these distributions to determine both the number of students who plan to work this summer and the number of Juniors who took part in the study.

Solution

In order to find these marginal distributions, we must find the total in each row and column of the contingency table. Once that is done, we can answer these questions by reading off the appropriate totals.

Freshman Sophomore Junior Senior Total
Work 23 35 30 51 139
Play 49 62 29 17 157
Study 28 17 46 13 104
Total 100 114 105 81
Table 1.2.23. Marginal Distributions of Class Standing and Summer Plans

From the marginal distribution for summer activities (the row totals), we find that 139 students plan to work this summer. From the marginal distribution for class standing (the column totals), we see that 105 Juniors participated in the study.

Note that the sum of the row totals and column totals should both add up to the grand total, which is 400 in this case.

A contingency table is a great tool for summarizing bivariate qualitative data. But in order to describe this data with a graph, we will need to make a modification to our bar graph. We will use a special type of bar graph called a stacked bar graph.

Definition 1.2.24.

A stacked bar graph is used to display the data in a contingency table. One of the two variables is chosen as the primary variable, and a bar is created for each category of the primary variable. The height of the bar is equal to the marginal distribution entry for that category. The bar is broken up into “stacked” pieces, the heights of which represent the frequency of the secondary variable categories within that primary variable category.

This definition can be difficult to decipher. Let's look at stacked bar graphs for the summer plans data. Again, these graphs were created using a spreadsheet program.

Create two stacked bar graphs for the contingency table seen in Example 1.2.19–one graph with class standing as the primary variable, and the other with summer plans as the primary variable.

Solution

First we create a graph with class standing as the primary variable.

One bar for each class with height equal to the number in that class and broken into three segments, one for each summer activity, whose heights are equal to the number in that class planning that activity.
Figure 1.2.26. Summer Plans by Class Standing

And then the graph with summer plans as the primary variable.

One bar for each summer activity with height equal to the number planning that activity and broken into four segments, one for each class standing, whose heights are equal to the number in that class planning that activity.
Figure 1.2.27. Class Standing by Summer Plans

To see more examples of summarizing and displaying graphs for qualitative data, see the following video examples.

Figure 1.2.28. Summarizing and Graphing Qualitative Data I
Figure 1.2.29. Summarizing and Graphing Qualitative Data II
Figure 1.2.30. Summarizing and Graphing Bivariate Qualitative Data

You have seen several definitions and methods for summarizing and displaying qualitative data.

Question: Which of the following are true statements?

  1. A relative frequency table uses percents instead of counts.

  2. We can never construct a bar chart or a pie chart based on the same data set.

  3. Large sets of raw data are easily understood.

  4. The order in which categories in a bar graph are listed is important.

  5. The marginal distributions in a contingency table must add to the same totals.

Answer

The first and last statements are true. The others are false.

You have seen several definitions and methods for summarizing and displaying qualitative data.

Question: Which of the following statements are false?

  1. The bar chart and relative frequency bar chart for the same data set will have different shapes.

  2. Marginal distributions summarize a single variable.

  3. A pie graph is a good way to show the relationship of each category to the whole.

  4. A frequency table and relative frequency table for a set of data may have different numbers of rows.

  5. A stacked bar chart wil not work to graph univariate data.

Answer

The first and third statements are false. The others are true.

You have seen several definitions and methods for summarizing and displaying qualitative data.

Question: Which of the following statements are false?

  1. To construct a relative frequency table, divide the frequency for each category by 100.

  2. A stacked bar chart is used to graph the contents of a frequency table.

  3. A contingency table is used to summarize bivariate data.

  4. There should be spaces between the bars in a bar graph.

  5. There is more than one way to graph qualitative data.

Answer

The first and second statements are false. The others are true.

Subsection 1.2.2 Quantitative Data

When we wish to describe quantitative variables, we run into many of the same problems that we had with qualitative variables. The raw data itself is usually not very useful and must be summarized before we can construct graphs.

Subsubsection 1.2.2.1 Summarizing Quantitative Data with Tables

Our first task is to come up with a way to summarize this data in a table, like we did with frequency tables for qualitative variables. To do this, we must create “categories” to use as the rows for our table.

Definition 1.2.34.

A frequency distribution is a table constructed to summarize quantative data. It has the following components:

  • Class—the range of values taken on by the variable is broken up into sections called classes.

  • Upper Class Limit—the largest number that belongs to a class is called the upper class limit.

  • Class Width—the difference between the upper class limit of a given class and the upper class limit of the previous class (or the minimum value in the data set for the first class) is called the class width. This value should almost always be the same for each class.

Once the classes have been established, we treat them as categories and count the number of values that fall into each class. We then create a table with a column giving the classes and a second column containing their frequencies.

Note that although a frequency distribution will look very similar to a frequency table, there is a difference. Namely, we must break the range of values up into classes in order to construct our table. It is probably easiest to remember that a frequency table is used to summarize data from a qualitative variable, and a frequency distribution is used to summarize data from a quantitative variable.

The height of tomato plants is of particular interest to Daisy. She spends her summer gathering data on the heights of tomato plants in various randomly selected gardens in her home town. Summarize the data she collected below, showing these heights rounded to the nearest 10th of an inch, in a frequency distribution with five classes.

11.1 10.1 17.9 16.9 9.3 9.6 14.1 15.2
9.4 14.2 12.7 8.4 13.0 12.2 16.2 12.0
9.7 11.2 8.9 17.4 16.6 9.8 17.0 13.4
10.5 15.3 13.5 12.7 12.8 17.0 12.9 8.6
8.3 13.4 17.3 8.8 8.2 13.3 17.4 12.1
16.1 9.4 10.7 17.0 9.9 13.1 10.4 14.8
17.7 10.2 16.6 8.1 8.5 16.9 14.1 17.5
9.1 13.0 8.7 16.0 16.6 14.6 14.8 15.1
Table 1.2.36. Tomato Plant Heights
Solution

To construct our frequency distribution, we follow these steps:

  1. Determine the class width as follows:

    1. Find the difference between the highest number, 17.9, and the lowest, 8.1. We round to a range of 10.

    2. Find the class width by taking dividing the range by the number of classes: \(\frac{10}{5} = 2\text{.}\)

  2. Starting with our minimum value (we'll round down to 8), add the class width to get the upper class limits: 10, 12, 14, 16, and 18.

  3. List the classes in the first column of our table. Note that these should not include the previous class upper class limit, thus the second class is from 10.1-12, not 10-12.

  4. Finally, count the number of values that fall into each class. For example, there are 7 values bigger than 10, but less than or equal to 12.

Classes Frequency
8.1-10 17
10.1-12 7
12.1-14 14
14.1-16 10
16.1-18 16
Table 1.2.37. Frequency Distribution of Tomato Plant Heights

Once we have summarized our data in a frequency distribution, we can of course create a relative frequency distribution by dividing each class frequency by the grand total. We can also create a representative graph, which we will learn about in the next section.

Subsubsection 1.2.2.2 Histograms

The graph that we use to describe data summarized by a frequency distribution is very similar to the bar graphs we used to summarize data from a frequency table. There is, however, some important distinction.

Definition 1.2.38.

A histogram is a graph for quantitative data in which each class in a frequency distribution for that data is represented by a bar, the height of which corresponds to the frequency of that class. These bars must be arranged in ascending order of the classes and must touch each other.

Note that in a histogram, the order is important because the classes of a frequency distribution are themselves ordered from smallest values to largest values. Also, note that the bars touch to show that the classes are contiguous--that is, there is no space between the classes. Let's try constructing a histogram for the tomato plant heights seen in the previous example. We use a spreadsheet to accomplish this, in much the same way that we constructed a bar graph.

Construct a histogram for the frequency distribution found in Example 1.2.35.

Solution

Based on the frequency distribution built in the previous example, we construct the following histogram.

One bar for each range of tomato plant heights arranged in order of increasing height with no gaps between the bars and the height of the bar equal to the number of plants in the range the bar represents.
Figure 1.2.40. Tomato Plant Heights

notice that the bars in this graph touch each other, and that they are in order from lowest values (between 8.1 and 10) up to highest values (between 16.1 and 18). These are both important characteristics for histograms!

Figure 1.2.41. Frequency Distributions and Histograms I
Figure 1.2.42. Frequency Distributions and Histograms II

You have seen several definitions involving summarizing and displaying quantitative data.

Question: Which of the following are true statements?

  1. In a frequency distribution the class widths should be the same for all.

  2. The lower class limit is the highest number that can appear in a class.

  3. Frequency distributions and frequency tables are exactly the same thing.

  4. The order in which the bars in a histogram are arranged matters.

  5. It is okay to allow classes to overlap:for example, 5-10 and 10-15 are valid

Answer

The first and fourth statements are true. The others are false.

You have seen several definitions involving summarizing and displaying quantitative data.

Question: Which of the following are false statements?

  1. The bars in a histogram should always touch each other.

  2. The upper class limit of the last class in a frequency distribution can be less than the largest number in the data set.

  3. A frequency distribution is used to summarize data from a quantitative variable.

  4. A bar graph may be used to graph data from a frequency distribution.

  5. To find an approximate class width, take the maximum value in the data set, subtract the minimum value, and divide by the desired number of classes.

Answer

The second and fourth statements are false. The others are true.

You have seen several definitions and methods for summarizing and displaying qualitative data.

Question: Which of the following are false statements?

  1. Histograms and frequency distributions go together.

  2. A given set of data will always produce the same frequency distribution, regardless of the class width chosen.

  3. Bar graphs and frequency tables go together.

  4. A frequency table can not be used to summarize a quantitative variable.

  5. The class widths can be different for different classes in a frequency distribution.

Answer

The second and the last statements are false. The others are true.

Subsubsection 1.2.2.3 Stem-and-Leaf Plots

One of the disadvantages of summarizing quantitative data with a frequency distribution and histogram is that you lose the specific values. That is, you can not tell what the individual data values were before the summary was created.

Using the frequency distribution created in Example 1.2.35, determine how many tomato plants are exactly 14.2 inches tall.

Solution

Looking just at the frequency distribution we created for this data, this is not possible. If all we have is the frequency distribution, we can only tell that there are 10 plants between 14.1 and 16 inches tall. Some of those may be 14.2 inches tall, but there is no way of knowing without going back to the raw data.

If we wish to create a pictorial representation of a set of quantitative data that also shows the raw data, we need a new type of graph.

Definition 1.2.47.

In a stem-and-leaf plot, data is represented by separating each value into two parts: the stem (all but the right-most digit) and leaf ( right-most digit). A table is then constructed with a row for each stem with that stem in the first column and a list of all leaves which belong from that stem in the second column, given in ascending order with no spaces or commas between them.

Stem-and-leaf plots are usually created by hand. Because of this, they are most useful for relatively small data sets for which we wish to get a quick picture of the shape of the data. To get an accurate picture, it is important that each “leaf” take up the same amount of space. So if you are typing a stem-and-leaf plot into a computer, use a mono-spaced font such as courier so that the ones (1) don't take up less space than the eights (8). Consider the following example.

The following is a list of number of quarter credits taken by 30 different college students. Construct a stem-and-leaf plot for this data.

12 17 14 13 16 16 18 8 21 16
14 13 16 17 16 16 9 3 18 15
15 12 11 15 21 16 15 6 15 19
Table 1.2.49. Number of Credits Taken
Solution

We construct our stem-and-leaf plot as follows:

  1. Divide each value into the stem (10's digit) and leaf (1's digit) pieces. Note that this yields stems 0, 1, and 2.

  2. List the stems in order in the first column of our table.

  3. For each stem, list the associated leaves in ascending order.

This produces the following plot.

A table with one row for each stem (first digit)--0, 1, and 2--and in those rows a list of the leaves (second digit) in ascending order.
Figure 1.2.50. Stem-and-Leaf Plot

Note that the “shape” we see here is like a histogram tilted on its side. However, in this particular instance the middle stem seems to have a disproportionate number of leaves. To remedy this, we can split our stems up into smaller pieces.

Definition 1.2.51.

An expanded stem-and-leaf plot includes more than one row for each stem. The leaves are divided equally between the rows representing the associated stems.

Construct an expanded stem-and-leaf plot for the credits data from Example 1.2.48 using two rows per stem.

Solution

This is done in almost exactly the same way as our last example. The only difference is that each stem gets two rows. The leaves are then split up with 0-4 going in the first row and 5-9 in the second row for a given stem. The result is shown to the right.

A table with two rows for each stem (first digit)--0, 1, and 2--and in the first of each of those rows a list of the leaves (second digit) for that stem from 0 to 4 in ascending order and in the second of each of those rows a list of the leaves for that stem from 5 to 9 in ascending order.
Figure 1.2.53. Expanded Stem-and-Leaf Plot

This gives us a clearer picture of how the credit hour values are distributed!

Figure 1.2.54. Stem-and-Leaf Plots I
Figure 1.2.55. Stem-and-Leaf Plots II

A biologist researching seagulls counts the number of seagulls that enter a nesting ground during several 15 minute periods. She records the following results.

24 50 46 34 27
39 26 37 43 37
40 37 51 26 36
51 43 34 32 48
Table 1.2.57. Number of Seagulls

Construct a stem-and-leaf plot for this data.

Solution
First row: stem 2, leaves 4, 6, 6, and 7. Second row: stem 3, leaves 2, 4, 4, 6, 7, 7, 7, and 9. Third row: step 4, leaves 0,3,3,6, and 8. Last row: step 5, leaves 0, 1, and 1.
Figure 1.2.58. Stem-and-Leaf Plot

A biologist researching seagulls counts the number of seagulls that enter a nesting ground during several 15 minute periods. She records the following results.

24 50 46 34 27
39 26 37 43 37
40 37 51 26 36
51 43 34 32 48
Table 1.2.60. Number of Seagulls

Construct an expanded stem-and-leaf plot for this data.

Solution
Row one: stem 2 and leaf 4. Row two: stem 2 and leaves 6, 6, and 7. Row three: stem 3 and leaves 2, 4, and 4. Row four: stem 3 and leaves 6, 7, 7, 7, and 9. Row five: stem 4 and leaves 0, 3, and 3. Row six: stem 4 and leaves 6 and 8. Row seven: stem 5 and leaves 0, 1, and 1. Row eight: stem 5 and no leaves.
Figure 1.2.61. Expanded Stem-and-Leaf Plot

Subsubsection 1.2.2.4 Other Graphs

There are several other graphs that can be useful to help display certain types of data. One of the most useful “other” types of graphs is meant to display the value of a single quantitative variable measured over time.

Definition 1.2.62.

A time series graph, also called a line chart, lists times along the horizontal axis, and values along the vertical axis. Points are placed at the appropriate height for the value of the variable at each time. These points are then connected with line segments.

To see how a time series can be used, consider the next example.

The value of a certain stock is recorded every January for five years. The following data is recorded. Construct a time series graph for this data.

Year 2005 2006 2007 2008 2009
Value $42 $47.5 $51 $46.3 $35.2
Table 1.2.64. Stock Price Over Time
Solution

Using a spreadsheet program, we can produce the following graph.

A marker for each year is placed at the height corresponding to the stock value during that year.  Then the markers are connected by line segments to show how the value changed from year to year.
Figure 1.2.65. Time Series of Stock Price

We need another type of graph to represent bivariate quantitative variables. This sort of data comes in \((x,y)\) pairs where \(x\) represents the value of the first variable and \(y\) the value of the second. In order to describe the relationship between the two variables, we can plot these pairs on a graph, creating what's called a scatter plot.

Definition 1.2.66.

In a scatter plot we associate one variable from a set of bivariate data with the \(x\)-axis and the other with the \(y\)-axis. We then plot each pair of values as an \((x,y)\) point on this coordinate system.

To see how this works, let's look at an example.

The tail length and weight of a certain breed of mice are measured. Construct a scatter plot and comment on the relationship, if any, that this graph shows between the two variables.

Mouse Number Tail Length (cm) Weight (oz)
Mouse #1 7 3.3
Mouse #2 8.2 4.6
Mouse #3 12.1 8.6
Mouse #4 9.3 6.1
Mouse #5 11.5 7.3
Mouse #6 7.8 2.6
Mouse #7 10.4 7.0
Table 1.2.68. Bivariate Mice Data
Solution

Using a spreadsheet, we construct the following scatter plot.

A point is placed for each mouse with x-coordinate corresponding to the mouses tail length and y-coordinate to its weight.  This shows a fairly strong linear trend with smaller tail lenghts seen in lighter mice and longer tails in heavier mice.
Figure 1.2.69. A Scatter Plot of Mouse Tail Length vs. Weight

There definitely appears to be a relationship between tail length and weight. Larger tail lengths seem to correspond to larger weights. While this is not always the case (mouse #6 is an exception) it does appear to be a fairly strong relationship.

Figure 1.2.70. Other Graphs I
Figure 1.2.71. Other Graphs II

You have been asked to come up with a quick graph for a set of 15 values from a quantitative variable ranging between 10 and 40.

Question: What sort of graph would be most appropriate in this situation?

Answer

A stem-and-leaf plot

Data has been collected from 100 ten-year-old children. For each child the parents' combined income, and the child's reading level are recorded. You wish to construct a graph to determine if there is a relationship between these two variables.

Question: What type of graph is most appropriate for this data?

Answer

A scatter plot

You are asked to determine if the interest rate for a certificate deposit has changed in the last two years. You are given a set of data consisting of interest rates for the certificate of deposit every month for the last two years.

Question: What type of graph is most appropriate for this data?

Answer

Time Series

Subsubsection 1.2.2.5 Interpreting Graphs

The reason that we create graphs in the first place is to help us better understand a set of data. It is therefore important that we know how to interpret the graphs that we create. In this class, we will spend most of our time working with univariate quantitative variables. We thus need to be sure that we know how to interpret histograms. There are three major things to look for in a histogram.

Definition 1.2.75.

The center of a histogram is the class that most accurately describes the typical value in the data set being described.

Definition 1.2.76.

One way to describe the shape of a histogram is based on the number and location of its modes. A mode is a class that contains significantly more data than the classes immediately next to it. A histogram can be unimodal (one mode), bimodal (two modes), multimodal (more than two modes), or it may have no prominent modes in which case it is called uniform.

Definition 1.2.77.

Another way to describe the shape of a histogram is based on its symmetry. A histogram may be symmetric (meaning it has the same shape on the left and right of its center), skewed right (meaning that the right tail is longer than the left) or skewed left (meaning that the left tail is longer than the right).

We can tell a lot about the distribution of values of a single quantitative variable by describing the histogram in these terms. Consider the following example.

Locate the center of each histogram and then classify it by number of modes and symmetry.

Histogram bars increase from left to right with until the next-to-last bar, which is the highest.  The last bar is somewhat lower.
(a)
Histogram bars increase from left to right for the first three bars.  The fourth bar is significantly lower, and then the last bar is about the same height as the third.
(b)
Histogram bars increase from left to right until the third bar, which is the highest.  Then they decrease at about the same rate at which they increased.
(c)
Histogram bars are at about the same height all the way across, with the second bar a little higher and the fourth a little lower than the rest.
(d)
The left most histogram bar is short, the second is quite tall, the third is short again (but a little taller than the first), the fourt is tall again (and a little taller than the second) and the last one is between the second and third in height.
(e)
The first bar is quite tall, the second is very short, the third is tall again (a little taller than the first) and the fourth is short (a little shorter than the second).  The last bar is almost as tall as the first.
(f)
Figure 1.2.79. Various Histograms to Classify
Solution
  1. This histogram has a single mode in the 15.1-20 class. The histogram is skewed to the left (that tail is longer) and the center appears to be somewhere between the 10.1-15 class and the 15.1-20 class.

  2. Histogram (b) has two modes (bimodal). It is also skewed to the left, and the center appears to be again between the 10.1-15 and 15.1-20 classes.

  3. This histogram is unimodal and symmetric. The center is in the 10.1-15 class.

  4. Histogram (d) does not appear to have any modes--while some of the bars are taller than their neighbors, the difference is very small. We will call this histogram uniform and symmetric with the center right in the middle class, 10.1-15.

  5. This histogram is bimodal and appears to be slightly skewed to the left. The center appears to be in the middle class.

  6. Our final histogram is multimodal (three modes) and symmetric. The center is in the middle class, 10.1-15.

Graph (c) above is an example of a very common histogram shape. This shape is so common, in fact, that it has a distinctive name.

Definition 1.2.80.

A histogram is called mound-shaped if it is unimodal and symmetric, with the mode in the exact center of the histogram.

Figure 1.2.81. Histogram Shapes I
Figure 1.2.82. Histogram Shapes II

Consider the histogram shown below.

Figure 1.2.84.

Question: Which of the following terms correctly describes this histogram?

  • Unimodal

  • Bimodal

  • Uniform

  • Symmetric

  • Skewed Right

  • Skewed Left

Answer

The histogram is bimodal and skewed right

Consider the histogram shown below.

Figure 1.2.86.

Question: Which of the following terms correctly describes this histogram?

  • Bimodal

  • Skewed Left

  • Skewed Right

  • Mound-Shaped

  • Multimodal

  • Uniform

Answer

This histogram is Mound-Shaped

Consider the histogram shown below.

Figure 1.2.88.

Question: Which of the following terms correctly describes this histogram?

  • Mound Shaped

  • Skewed Left

  • Unimodal

  • Bimodal

  • Uniform

  • Skewed Right

Answer

The histogram is uniform.

Subsection 1.2.3 Things to Avoid

There are a few common errors in creating and using graphs that should be mentioned. The first such error is commonly caused either by a “helpful” computer program with poorly chosen default settings, or by a person who is trying to manipulate the graph to “lie” with statistics.

The following frequency table gives the number of individuals in a recent study who said that they had full health insurance coverage, some health insurance coverage, and no health insurance coverage. Next to those numbers are two bar charts representing this data. Determine which bar chart is misleading, and explain why it is misleading.

Status Frequency
No Insurance 102
Some Insurance 95
Full Insurance 82
Table 1.2.90. Frequency Table
(a)
(b)
Figure 1.2.91. Bar Graphs with Different Scales
Solution

Both bar charts were constructed using the same data from the same frequency table. But the first graph is accurate while the second is not. Notice that the vertical scale in Figure 1.2.91.(a) starts at 0 while the vertical scale in Figure 1.2.91.(b) begins at 80. This exaggerates the difference between the number of individuals with full coverage and those with some or no coverage.

This is one example of a problem caused by a violation of an important principle in graphing.

Definition 1.2.92.

The area principle states that the area of any bar or sector in a graph representing a category in a frequency table or class in a frequency distribution must be in the same proportion to the rest of the graph as the frequency of the category or class is to the rest of the data in the table.

By changing the starting point of the frequency axis in the second histogram above, we have violated the area principle. This makes the “full insurance” bar look much shorter in relation to the other two bars than the numbers indicate. Another common error in creating graphs is to violate the area principle by creating three dimensional graphs. Consider the following.

A store manager wishes to know the general age group into which most of his customers fall. He does a quick survey and comes up with the following fequency distribution. He then creates a pie graph to show the proportion of customers in each age group. Do you think his pie graph accurately represents the data from the frequency distribution?

Ages Frequency
0-20 14
21-40 16
41-60 10
61+ 7
Table 1.2.94. Customers
Figure 1.2.95. 3D Pie Graph
Solution

This is not an accurate representation of the store manager's frequency distribution. Notice that the 0-20 year old age group looks like it may be the largest group based on the amount of gray on the graph. It seems to be three times as large as the 61+ group in the back of the graph. However, the 21-40 year old group is in fact the largest and the 61+ group is only about twice as large as the 0-20 year group. By adding the 3D effects, the store manager has violated the area principle.

The most important thing to remember from this section is that we must be sure not to sacrifice accuracy to create “pretty” looking graphs.

Figure 1.2.96. Things to Avoid I
Figure 1.2.97. Things to Avoid II

It is always best to use three dimensional effects to make your graphs more appealing.

Question: Is this statement true or false?

Answer

False

It is best to change the scale of the \(y\)-axis in a histogram or bar graph to start at a number close to the minimum value in the frequency distribution or frequency table, but bigger than zero.

Question: Is this statement true or false?

Answer

False

Violating the area principle may create misleading graphs.

Question: Is this statement true or false?

Answer

True