Section 1.2 Describing Data Graphically
¶Once we have collected our data, we need to be able to understand it. Data in its raw form--a long list of values--is not very informative. In this section we will explore ways in which we can organize data in tables and describe the values using graphs. The graph we choose to use will depend on the data we are describing as well as other factors. Some of these factors are:
Does our data come from a quantitative variable or a qualitative variable?
Is our data univariate or bivariate?
Do we want a "quick-and-dirty" graph we can sketch by hand, or will a computer be drawing the graph for us?
Does the data represent the value of a variable over time, or the value of a variable from many different subjects?
By answering these questions we will learn to choose and construct an appropriate graph to describe many different data sets. We will also learn to interpret these graphs as a preliminary step in analyzing our data.
Objectives
Upon completion of this lesson, you should be able to
-
describe the following terms:
area principle
bar graph
bimodal
center
class
class width
contingency table
expanded stem-and-leaf plot
frequency distribution
frequency table
histogram
marginal distribution
mound-shaped
multimodal
pie graph
relative frequency bar graph
relative frequency table
scatter plot
skewed left
skewed right
stacked bar graph
stem-and-leaf plot
symmetric
time series
uniform
unimodal
upper class limit
-
accomplish the following tasks:
Organize data into frequency tables or frequency distributions
Create bar charts and histograms based on tables
Construct pie charts and charts for bivariate data
Construct Stem-and-Leaf plots
Select appropriate graphs to summarize given data
Interpret graphs by describing the center and shape
Subsection 1.2.1 Qualitative Data
¶Recall that a qualitative variable produces data that represents categories, and not numerical values. Consider the following example.
Example 1.2.1. Organizing Data.
To better understand their customers, a pizza parlor decides to collect a list of favorite pizza toppings. They choose an appropriate sampling technique and gather the following data from 96 customers. What does this data tell you?
cheese | green peppers | olives | cheese |
olives | cheese | cheese | cheese |
green peppers | cheese | pepperoni | olives |
pineapple | onions | cheese | sausage |
onions | anchovies | olives | cheese |
green peppers | cheese | cheese | cheese |
pineapple | mushrooms | green peppers | pineapple |
cheese | green peppers | sausage | cheese |
sausage | bacon | bacon | green peppers |
cheese | pepperoni | pepperoni | olives |
pepperoni | pepperoni | green peppers | pepperoni |
olives | olives | olives | green peppers |
pepperoni | pepperoni | mushrooms | onions |
bacon | sausage | sausage | olives |
sausage | green peppers | olives | bacon |
sausage | pepperoni | pineapple | olives |
mushrooms | bacon | onions | bacon |
olives | sausage | mushrooms | cheese |
pineapple | bacon | green peppers | pepperoni |
olives | pepperoni | pepperoni | pepperoni |
sausage | sausage | pepperoni | pepperoni |
cheese | bacon | onions | pepperoni |
pineapple | olives | cheese | bacon |
green peppers | bacon | pepperoni | olives |
Without some organization, it is hard to glean anything from this data. We need a way to simplify things!
Subsubsection 1.2.1.1 Frequency Tables
¶Our first step in constructing a graph for this data is to summarize. With any qualitative variable, we can construct what is called a frequency table to summarize the values in our data set.
Definition 1.2.3.
A frequency table is a two column table used to summarize the values of a qualitative variable. The first column contains the various categories that the variable can take on. The second column contains the counts or frequency with which each category appears.
To see how this works, let's use a frequency table to summarize the pizza topping data set we saw in our first example.
Example 1.2.4. A Frequency Table.
Construct a frequency table for the pizza topping data seen in the example above. Does this table give you a better picture of the data set?
Steps:
List all of the possible pizza toppings in the first column of our table.
Count the frequency with which each topping appears in the original data.
Check to ensure that the total of our counts adds up to the total number of values in the set of data.
This definitely gives us a clearer picture of the data set. We can now see the number of times each topping was selected as somebody's favorite.
Topping | Frequency |
Anchovies | 1 |
Bacon | 10 |
Cheese | 17 |
Green Peppers | 11 |
Mushrooms | 4 |
Olives | 15 |
Onions | 5 |
Pepperoni | 17 |
Pineapple | 6 |
Sausage | 10 |
Total: | 96 |
While this is a great way to summarize data, it can still be hard to tell how popular categories are in relation to one another. In the previous example, there is a difference of two between the number of customers who favor cheese and the number who favor olives. Is that a large difference? To better analyze this, we can change from using counts to using percents.
Definition 1.2.6. Relative Frequency Table.
A relative frequency table is a table in which each category is listed together with the ratio of the total values from the data set that fall into that category.
Let's revisit the pizza topping data once again, this time constructing a relative frequency table.
Example 1.2.7. A Relative Frequency Table.
Construct a relative frequency table for the pizza topping data seen in Example 1.2.1.
Steps:
Start with the frequency table from the previous example.
Divide each category frequency by the grand total (in this case 96).
Check to ensure that the sum is 1. Note that in this case, the extra 0.0002 is a result of rounding error in our four-decimal-place answers.
With this relative frequency table, we can more easily tell how each category relates to the others.
Topping | Relative Frequency |
Anchovies | \(\frac{1}{96} \approx 0.0104\) |
Bacon | \(\frac{10}{96} \approx 0.1042\) |
Cheese | \(\frac{17}{96} \approx 0.1771\) |
Green Peppers | \(\frac{11}{96} \approx 0.1146\) |
Mushrooms | \(\frac{4}{96} \approx 0.0417\) |
Olives | \(\frac{15}{96} \approx 0.1563\) |
Onions | \(\frac{5}{96} \approx 0.0521\) |
Pepperoni | \(\frac{17}{96} \approx 0.1771\) |
Pineapple | \(\frac{6}{96} \approx 0.0625\) |
Sausage | \(\frac{10}{96} \approx 0.1042\) |
Total: | \(1.0002\) |
While these tables are a good step in better displaying our data, we still need to translate this into a picture. To see how we do that, continue on to the next section.
Subsubsection 1.2.1.2 Graphs
¶One of the most basic ways to graphically represent data collected for a qualitative variable is using a bar graph.
Definition 1.2.9. Bar Graph.
A bar graph is a graph for qualitative data in which each possible category is represented by a bar, the height of which corresponds to the frequency of that category in the data set. These bars can be arranged in any order and are separated from each other by empty space.
Note that the first step in creating a bar chart from raw data is to construct a frequency table. Once that is done, a computer spreadsheet program such as OpenOffice or Microsoft Excel is a great way to quickly construct bar graphs. Consider the following example, based on our pizza topping dats.
Example 1.2.10. Constructing a Bar Graph.
Construct a bar graph for the pizza topping data seen in Example 1.2.1
Based on the frequency table for this data, we get the following bar graph.
What happens to the bar graph if we use the relative frequency table instead of the frequency table? We do get a new type of graph, but as we shall see, the shape of the graph will not change, just the scale. First, the definition.
Definition 1.2.12.
A relative frequency bar graph is a graph for qualitative data in which each category is represented by a bar, the height of which corresponds to the percent of the data values that are in that category. These bars can be arranged in any order and are separated from each other by empty space.
Below we construct a relative frequency bar graph for the pizza topping example.
Example 1.2.13. Constructing a Relative Frequency Bar Graph.
Construct a bar graph for the pizza topping data seen in Example 1.2.1.
Based on the relative frequency table, we get the following bar graph.
Notice that the bar graph is scaled in terms of percent instead of the relative frequencies from the table. This is a common quirk with using spreadsheet programs to produce such graphs.
Both the bar graph and relative frequency bar graph are particularly useful for comparing the number or percent of values that are in one category with the number or proportion in another category. However it is sometimes better to show how each category frequency or proportion relates to the whole. For this type of situation, a pie graph is more useful.
Definition 1.2.15.
A pie graph is a graph for qualitative data in which a circle is divided into sectors or wedges representing the various categories. The area of these sectors is determined by the relative frequency of the associated category.
With the same spreadsheet program we used to build our bar graphs, we can also construct pie graphs. To see how this works, we will revisit our pizza topping example one last time.
Example 1.2.16. Constructing a Pie Graph.
Construct a pie graph for the pizza topping data seen in Example 1.2.1.
Based on the relative frequency table previously constructed for this data, we get the following pie graph.
Again, spreadsheet programs often convert relative frequencies into percents when constructing a pie graph.
There is one more situation involving qualitative variables which we must investigate.
Subsubsection 1.2.1.3 Bivariate Data
¶The frequency tables and graphs that we've looked at so far work well for univariate data. But they need to be modified before they will work with data involving more than one variable. Multivariate data can be very difficult to summarize and display graphically. Because of this, we usually just look at each variable separately. However when we have two variables (bivariate data) there are some special techniques that we can use to summarize and display the data.
Definition 1.2.18.
A contingency table, also called a two-way table, is a table used to summarize bivariate qualitative data. The columns in this table represent the various categories of one of the two variables and the rows represent the categories of the other variable. In each cell (row-column combination) we record the number of subjects with that row category and column category.
To see how a contingency table can be read, let's look at an example involving college students and summer vacation plans.
Example 1.2.19. Using a Contingency Table.
Four hundred randomly selected college students are asked to give their class standing (Freshman, Sophomore, Junior, or Senior) and their summer plans (work, play, or study). The following contingency table summarizes the results of this survey.
Determinethe number of Freshmen who are not working this summer.
Freshman | Sophomore | Junior | Senior | |
Work | 23 | 35 | 30 | 51 |
Play | 49 | 62 | 29 | 17 |
Study | 28 | 17 | 46 | 13 |
The column labeled Freshmen has three rows corresponding to work, play, and study. To get the number of Freshmen who are not working, we add together those who areplaying and those who are studying which gives us \(49+28 = 77\text{.}\)
When we create a contingency table, or if we are just given a contingency table, we sometimes will want to look at each variable separately. In the example above, we may want to know how many of these students intend to work this summer. To answer this question, we need to total up the rows and/or columns. These totals are typically displayed in the margins of the table, leading to the following definition.
Definition 1.2.21.
A marginal distribution of a contingency table is either the sums of the rows of that contingency table, or the sum of the columns of that contingency table.
Let's find the marginal distributions for the student summer plans example above.
Example 1.2.22. Computing Marginal Distributions.
Construct the marginal distributions for the contingency table seen in Example 1.2.19. Use these distributions to determine both the number of students who plan to work this summer and the number of Juniors who took part in the study.
In order to find these marginal distributions, we must find the total in each row and column of the contingency table. Once that is done, we can answer these questions by reading off the appropriate totals.
Freshman | Sophomore | Junior | Senior | Total | |
Work | 23 | 35 | 30 | 51 | 139 |
Play | 49 | 62 | 29 | 17 | 157 |
Study | 28 | 17 | 46 | 13 | 104 |
Total | 100 | 114 | 105 | 81 |
From the marginal distribution for summer activities (the row totals), we find that 139 students plan to work this summer. From the marginal distribution for class standing (the column totals), we see that 105 Juniors participated in the study.
Note that the sum of the row totals and column totals should both add up to the grand total, which is 400 in this case.
A contingency table is a great tool for summarizing bivariate qualitative data. But in order to describe this data with a graph, we will need to make a modification to our bar graph. We will use a special type of bar graph called a stacked bar graph.
Definition 1.2.24.
A stacked bar graph is used to display the data in a contingency table. One of the two variables is chosen as the primary variable, and a bar is created for each category of the primary variable. The height of the bar is equal to the marginal distribution entry for that category. The bar is broken up into “stacked” pieces, the heights of which represent the frequency of the secondary variable categories within that primary variable category.
This definition can be difficult to decipher. Let's look at stacked bar graphs for the summer plans data. Again, these graphs were created using a spreadsheet program.
Example 1.2.25. Constructing Stacked Bar Graphs.
Create two stacked bar graphs for the contingency table seen in Example 1.2.19–one graph with class standing as the primary variable, and the other with summer plans as the primary variable.
First we create a graph with class standing as the primary variable.
And then the graph with summer plans as the primary variable.
To see more examples of summarizing and displaying graphs for qualitative data, see the following video examples.
Checkpoint 1.2.31. Qualitative Data—True/False.
You have seen several definitions and methods for summarizing and displaying qualitative data.
Question: Which of the following are true statements?
A relative frequency table uses percents instead of counts.
We can never construct a bar chart or a pie chart based on the same data set.
Large sets of raw data are easily understood.
The order in which categories in a bar graph are listed is important.
The marginal distributions in a contingency table must add to the same totals.
The first and last statements are true. The others are false.
Checkpoint 1.2.32. Qualitative Data—True/False.
You have seen several definitions and methods for summarizing and displaying qualitative data.
Question: Which of the following statements are false?
The bar chart and relative frequency bar chart for the same data set will have different shapes.
Marginal distributions summarize a single variable.
A pie graph is a good way to show the relationship of each category to the whole.
A frequency table and relative frequency table for a set of data may have different numbers of rows.
A stacked bar chart wil not work to graph univariate data.
The first and third statements are false. The others are true.
Checkpoint 1.2.33. Qualitative Data—True/False.
You have seen several definitions and methods for summarizing and displaying qualitative data.
Question: Which of the following statements are false?
To construct a relative frequency table, divide the frequency for each category by 100.
A stacked bar chart is used to graph the contents of a frequency table.
A contingency table is used to summarize bivariate data.
There should be spaces between the bars in a bar graph.
There is more than one way to graph qualitative data.
The first and second statements are false. The others are true.
Subsection 1.2.2 Quantitative Data
¶When we wish to describe quantitative variables, we run into many of the same problems that we had with qualitative variables. The raw data itself is usually not very useful and must be summarized before we can construct graphs.
Subsubsection 1.2.2.1 Summarizing Quantitative Data with Tables
¶Our first task is to come up with a way to summarize this data in a table, like we did with frequency tables for qualitative variables. To do this, we must create “categories” to use as the rows for our table.
Definition 1.2.34.
A frequency distribution is a table constructed to summarize quantative data. It has the following components:
Class—the range of values taken on by the variable is broken up into sections called classes.
Upper Class Limit—the largest number that belongs to a class is called the upper class limit.
Class Width—the difference between the upper class limit of a given class and the upper class limit of the previous class (or the minimum value in the data set for the first class) is called the class width. This value should almost always be the same for each class.
Once the classes have been established, we treat them as categories and count the number of values that fall into each class. We then create a table with a column giving the classes and a second column containing their frequencies.
Note that although a frequency distribution will look very similar to a frequency table, there is a difference. Namely, we must break the range of values up into classes in order to construct our table. It is probably easiest to remember that a frequency table is used to summarize data from a qualitative variable, and a frequency distribution is used to summarize data from a quantitative variable.
Example 1.2.35. Constructing a Frequency Distribution.
The height of tomato plants is of particular interest to Daisy. She spends her summer gathering data on the heights of tomato plants in various randomly selected gardens in her home town. Summarize the data she collected below, showing these heights rounded to the nearest 10th of an inch, in a frequency distribution with five classes.
11.1 | 10.1 | 17.9 | 16.9 | 9.3 | 9.6 | 14.1 | 15.2 |
9.4 | 14.2 | 12.7 | 8.4 | 13.0 | 12.2 | 16.2 | 12.0 |
9.7 | 11.2 | 8.9 | 17.4 | 16.6 | 9.8 | 17.0 | 13.4 |
10.5 | 15.3 | 13.5 | 12.7 | 12.8 | 17.0 | 12.9 | 8.6 |
8.3 | 13.4 | 17.3 | 8.8 | 8.2 | 13.3 | 17.4 | 12.1 |
16.1 | 9.4 | 10.7 | 17.0 | 9.9 | 13.1 | 10.4 | 14.8 |
17.7 | 10.2 | 16.6 | 8.1 | 8.5 | 16.9 | 14.1 | 17.5 |
9.1 | 13.0 | 8.7 | 16.0 | 16.6 | 14.6 | 14.8 | 15.1 |
To construct our frequency distribution, we follow these steps:
-
Determine the class width as follows:
Find the difference between the highest number, 17.9, and the lowest, 8.1. We round to a range of 10.
Find the class width by taking dividing the range by the number of classes: \(\frac{10}{5} = 2\text{.}\)
Starting with our minimum value (we'll round down to 8), add the class width to get the upper class limits: 10, 12, 14, 16, and 18.
List the classes in the first column of our table. Note that these should not include the previous class upper class limit, thus the second class is from 10.1-12, not 10-12.
Finally, count the number of values that fall into each class. For example, there are 7 values bigger than 10, but less than or equal to 12.
Classes | Frequency |
8.1-10 | 17 |
10.1-12 | 7 |
12.1-14 | 14 |
14.1-16 | 10 |
16.1-18 | 16 |
Once we have summarized our data in a frequency distribution, we can of course create a relative frequency distribution by dividing each class frequency by the grand total. We can also create a representative graph, which we will learn about in the next section.
Subsubsection 1.2.2.2 Histograms
¶The graph that we use to describe data summarized by a frequency distribution is very similar to the bar graphs we used to summarize data from a frequency table. There is, however, some important distinction.
Definition 1.2.38.
A histogram is a graph for quantitative data in which each class in a frequency distribution for that data is represented by a bar, the height of which corresponds to the frequency of that class. These bars must be arranged in ascending order of the classes and must touch each other.
Note that in a histogram, the order is important because the classes of a frequency distribution are themselves ordered from smallest values to largest values. Also, note that the bars touch to show that the classes are contiguous--that is, there is no space between the classes. Let's try constructing a histogram for the tomato plant heights seen in the previous example. We use a spreadsheet to accomplish this, in much the same way that we constructed a bar graph.
Example 1.2.39. Constructing a Histogram.
Construct a histogram for the frequency distribution found in Example 1.2.35.
Based on the frequency distribution built in the previous example, we construct the following histogram.
notice that the bars in this graph touch each other, and that they are in order from lowest values (between 8.1 and 10) up to highest values (between 16.1 and 18). These are both important characteristics for histograms!
Checkpoint 1.2.43. Frequency Distributions and Histograms.
You have seen several definitions involving summarizing and displaying quantitative data.
Question: Which of the following are true statements?
In a frequency distribution the class widths should be the same for all.
The lower class limit is the highest number that can appear in a class.
Frequency distributions and frequency tables are exactly the same thing.
The order in which the bars in a histogram are arranged matters.
It is okay to allow classes to overlap:for example, 5-10 and 10-15 are valid
The first and fourth statements are true. The others are false.
Checkpoint 1.2.44. Frequency Distributions and Histograms.
You have seen several definitions involving summarizing and displaying quantitative data.
Question: Which of the following are false statements?
The bars in a histogram should always touch each other.
The upper class limit of the last class in a frequency distribution can be less than the largest number in the data set.
A frequency distribution is used to summarize data from a quantitative variable.
A bar graph may be used to graph data from a frequency distribution.
To find an approximate class width, take the maximum value in the data set, subtract the minimum value, and divide by the desired number of classes.
The second and fourth statements are false. The others are true.
Checkpoint 1.2.45. Frequency Distributions and Histograms.
You have seen several definitions and methods for summarizing and displaying qualitative data.
Question: Which of the following are false statements?
Histograms and frequency distributions go together.
A given set of data will always produce the same frequency distribution, regardless of the class width chosen.
Bar graphs and frequency tables go together.
A frequency table can not be used to summarize a quantitative variable.
The class widths can be different for different classes in a frequency distribution.
The second and the last statements are false. The others are true.
Subsubsection 1.2.2.3 Stem-and-Leaf Plots
¶One of the disadvantages of summarizing quantitative data with a frequency distribution and histogram is that you lose the specific values. That is, you can not tell what the individual data values were before the summary was created.
Example 1.2.46. Recovering Data from a Frequency Distribution.
Using the frequency distribution created in Example 1.2.35, determine how many tomato plants are exactly 14.2 inches tall.
Looking just at the frequency distribution we created for this data, this is not possible. If all we have is the frequency distribution, we can only tell that there are 10 plants between 14.1 and 16 inches tall. Some of those may be 14.2 inches tall, but there is no way of knowing without going back to the raw data.
If we wish to create a pictorial representation of a set of quantitative data that also shows the raw data, we need a new type of graph.
Definition 1.2.47.
In a stem-and-leaf plot, data is represented by separating each value into two parts: the stem (all but the right-most digit) and leaf ( right-most digit). A table is then constructed with a row for each stem with that stem in the first column and a list of all leaves which belong from that stem in the second column, given in ascending order with no spaces or commas between them.
Stem-and-leaf plots are usually created by hand. Because of this, they are most useful for relatively small data sets for which we wish to get a quick picture of the shape of the data. To get an accurate picture, it is important that each “leaf” take up the same amount of space. So if you are typing a stem-and-leaf plot into a computer, use a mono-spaced font such as courier so that the ones (1) don't take up less space than the eights (8). Consider the following example.
Example 1.2.48. Constructing a Stem-and-Leaf Plot.
The following is a list of number of quarter credits taken by 30 different college students. Construct a stem-and-leaf plot for this data.
12 | 17 | 14 | 13 | 16 | 16 | 18 | 8 | 21 | 16 |
14 | 13 | 16 | 17 | 16 | 16 | 9 | 3 | 18 | 15 |
15 | 12 | 11 | 15 | 21 | 16 | 15 | 6 | 15 | 19 |
We construct our stem-and-leaf plot as follows:
Divide each value into the stem (10's digit) and leaf (1's digit) pieces. Note that this yields stems 0, 1, and 2.
List the stems in order in the first column of our table.
For each stem, list the associated leaves in ascending order.
This produces the following plot.
Note that the “shape” we see here is like a histogram tilted on its side. However, in this particular instance the middle stem seems to have a disproportionate number of leaves. To remedy this, we can split our stems up into smaller pieces.
Definition 1.2.51.
An expanded stem-and-leaf plot includes more than one row for each stem. The leaves are divided equally between the rows representing the associated stems.
Example 1.2.52. Constructing an Expanded Stem-and-Leaf Plot.
Construct an expanded stem-and-leaf plot for the credits data from Example 1.2.48 using two rows per stem.
This is done in almost exactly the same way as our last example. The only difference is that each stem gets two rows. The leaves are then split up with 0-4 going in the first row and 5-9 in the second row for a given stem. The result is shown to the right.
This gives us a clearer picture of how the credit hour values are distributed!
Checkpoint 1.2.56. Stem-and-Leaf Plots.
A biologist researching seagulls counts the number of seagulls that enter a nesting ground during several 15 minute periods. She records the following results.
24 | 50 | 46 | 34 | 27 |
39 | 26 | 37 | 43 | 37 |
40 | 37 | 51 | 26 | 36 |
51 | 43 | 34 | 32 | 48 |
Construct a stem-and-leaf plot for this data.
Checkpoint 1.2.59. Stem-and-Leaf Plots.
A biologist researching seagulls counts the number of seagulls that enter a nesting ground during several 15 minute periods. She records the following results.
24 | 50 | 46 | 34 | 27 |
39 | 26 | 37 | 43 | 37 |
40 | 37 | 51 | 26 | 36 |
51 | 43 | 34 | 32 | 48 |
Construct an expanded stem-and-leaf plot for this data.
Subsubsection 1.2.2.4 Other Graphs
¶There are several other graphs that can be useful to help display certain types of data. One of the most useful “other” types of graphs is meant to display the value of a single quantitative variable measured over time.
Definition 1.2.62.
A time series graph, also called a line chart, lists times along the horizontal axis, and values along the vertical axis. Points are placed at the appropriate height for the value of the variable at each time. These points are then connected with line segments.
To see how a time series can be used, consider the next example.
Example 1.2.63. Using a Time Series.
The value of a certain stock is recorded every January for five years. The following data is recorded. Construct a time series graph for this data.
Year | 2005 | 2006 | 2007 | 2008 | 2009 |
Value | $42 | $47.5 | $51 | $46.3 | $35.2 |
Using a spreadsheet program, we can produce the following graph.
We need another type of graph to represent bivariate quantitative variables. This sort of data comes in \((x,y)\) pairs where \(x\) represents the value of the first variable and \(y\) the value of the second. In order to describe the relationship between the two variables, we can plot these pairs on a graph, creating what's called a scatter plot.
Definition 1.2.66.
In a scatter plot we associate one variable from a set of bivariate data with the \(x\)-axis and the other with the \(y\)-axis. We then plot each pair of values as an \((x,y)\) point on this coordinate system.
To see how this works, let's look at an example.
Example 1.2.67. Constructing a Scatter Plot.
The tail length and weight of a certain breed of mice are measured. Construct a scatter plot and comment on the relationship, if any, that this graph shows between the two variables.
Mouse Number | Tail Length (cm) | Weight (oz) |
Mouse #1 | 7 | 3.3 |
Mouse #2 | 8.2 | 4.6 |
Mouse #3 | 12.1 | 8.6 |
Mouse #4 | 9.3 | 6.1 |
Mouse #5 | 11.5 | 7.3 |
Mouse #6 | 7.8 | 2.6 |
Mouse #7 | 10.4 | 7.0 |
Using a spreadsheet, we construct the following scatter plot.
There definitely appears to be a relationship between tail length and weight. Larger tail lengths seem to correspond to larger weights. While this is not always the case (mouse #6 is an exception) it does appear to be a fairly strong relationship.
Checkpoint 1.2.72. Appropriate Graphs.
You have been asked to come up with a quick graph for a set of 15 values from a quantitative variable ranging between 10 and 40.
Question: What sort of graph would be most appropriate in this situation?
A stem-and-leaf plot
Checkpoint 1.2.73. Appropriate Graphs.
Data has been collected from 100 ten-year-old children. For each child the parents' combined income, and the child's reading level are recorded. You wish to construct a graph to determine if there is a relationship between these two variables.
Question: What type of graph is most appropriate for this data?
A scatter plot
Checkpoint 1.2.74. Appropriate Graphs.
You are asked to determine if the interest rate for a certificate deposit has changed in the last two years. You are given a set of data consisting of interest rates for the certificate of deposit every month for the last two years.
Question: What type of graph is most appropriate for this data?
Time Series
Subsubsection 1.2.2.5 Interpreting Graphs
¶The reason that we create graphs in the first place is to help us better understand a set of data. It is therefore important that we know how to interpret the graphs that we create. In this class, we will spend most of our time working with univariate quantitative variables. We thus need to be sure that we know how to interpret histograms. There are three major things to look for in a histogram.
Definition 1.2.75.
The center of a histogram is the class that most accurately describes the typical value in the data set being described.
Definition 1.2.76.
One way to describe the shape of a histogram is based on the number and location of its modes. A mode is a class that contains significantly more data than the classes immediately next to it. A histogram can be unimodal (one mode), bimodal (two modes), multimodal (more than two modes), or it may have no prominent modes in which case it is called uniform.
Definition 1.2.77.
Another way to describe the shape of a histogram is based on its symmetry. A histogram may be symmetric (meaning it has the same shape on the left and right of its center), skewed right (meaning that the right tail is longer than the left) or skewed left (meaning that the left tail is longer than the right).
We can tell a lot about the distribution of values of a single quantitative variable by describing the histogram in these terms. Consider the following example.
Example 1.2.78. Describing Histograms.
Locate the center of each histogram and then classify it by number of modes and symmetry.
This histogram has a single mode in the 15.1-20 class. The histogram is skewed to the left (that tail is longer) and the center appears to be somewhere between the 10.1-15 class and the 15.1-20 class.
Histogram (b) has two modes (bimodal). It is also skewed to the left, and the center appears to be again between the 10.1-15 and 15.1-20 classes.
This histogram is unimodal and symmetric. The center is in the 10.1-15 class.
Histogram (d) does not appear to have any modes--while some of the bars are taller than their neighbors, the difference is very small. We will call this histogram uniform and symmetric with the center right in the middle class, 10.1-15.
This histogram is bimodal and appears to be slightly skewed to the left. The center appears to be in the middle class.
Our final histogram is multimodal (three modes) and symmetric. The center is in the middle class, 10.1-15.
Graph (c) above is an example of a very common histogram shape. This shape is so common, in fact, that it has a distinctive name.
Definition 1.2.80.
A histogram is called mound-shaped if it is unimodal and symmetric, with the mode in the exact center of the histogram.
Checkpoint 1.2.83. Histogram Shapes.
Consider the histogram shown below.
Question: Which of the following terms correctly describes this histogram?
Unimodal
Bimodal
Uniform
Symmetric
Skewed Right
Skewed Left
The histogram is bimodal and skewed right
Checkpoint 1.2.85. Histogram Shapes.
Consider the histogram shown below.
Question: Which of the following terms correctly describes this histogram?
Bimodal
Skewed Left
Skewed Right
Mound-Shaped
Multimodal
Uniform
This histogram is Mound-Shaped
Checkpoint 1.2.87. Histogram Shapes.
Consider the histogram shown below.
Question: Which of the following terms correctly describes this histogram?
Mound Shaped
Skewed Left
Unimodal
Bimodal
Uniform
Skewed Right
The histogram is uniform.
Subsection 1.2.3 Things to Avoid
¶There are a few common errors in creating and using graphs that should be mentioned. The first such error is commonly caused either by a “helpful” computer program with poorly chosen default settings, or by a person who is trying to manipulate the graph to “lie” with statistics.
Example 1.2.89. Incorrect Vertical Axis.
The following frequency table gives the number of individuals in a recent study who said that they had full health insurance coverage, some health insurance coverage, and no health insurance coverage. Next to those numbers are two bar charts representing this data. Determine which bar chart is misleading, and explain why it is misleading.
Status | Frequency |
No Insurance | 102 |
Some Insurance | 95 |
Full Insurance | 82 |
Both bar charts were constructed using the same data from the same frequency table. But the first graph is accurate while the second is not. Notice that the vertical scale in Figure 1.2.91.(a) starts at 0 while the vertical scale in Figure 1.2.91.(b) begins at 80. This exaggerates the difference between the number of individuals with full coverage and those with some or no coverage.
This is one example of a problem caused by a violation of an important principle in graphing.
Definition 1.2.92.
The area principle states that the area of any bar or sector in a graph representing a category in a frequency table or class in a frequency distribution must be in the same proportion to the rest of the graph as the frequency of the category or class is to the rest of the data in the table.
By changing the starting point of the frequency axis in the second histogram above, we have violated the area principle. This makes the “full insurance” bar look much shorter in relation to the other two bars than the numbers indicate. Another common error in creating graphs is to violate the area principle by creating three dimensional graphs. Consider the following.
Example 1.2.93. A Three-Dimensional Pie Graph.
A store manager wishes to know the general age group into which most of his customers fall. He does a quick survey and comes up with the following fequency distribution. He then creates a pie graph to show the proportion of customers in each age group. Do you think his pie graph accurately represents the data from the frequency distribution?
Ages | Frequency |
0-20 | 14 |
21-40 | 16 |
41-60 | 10 |
61+ | 7 |
This is not an accurate representation of the store manager's frequency distribution. Notice that the 0-20 year old age group looks like it may be the largest group based on the amount of gray on the graph. It seems to be three times as large as the 61+ group in the back of the graph. However, the 21-40 year old group is in fact the largest and the 61+ group is only about twice as large as the 0-20 year group. By adding the 3D effects, the store manager has violated the area principle.
The most important thing to remember from this section is that we must be sure not to sacrifice accuracy to create “pretty” looking graphs.
Checkpoint 1.2.98. Things to Avoid.
It is always best to use three dimensional effects to make your graphs more appealing.
Question: Is this statement true or false?
False
Checkpoint 1.2.99. Things to Avoid.
It is best to change the scale of the \(y\)-axis in a histogram or bar graph to start at a number close to the minimum value in the frequency distribution or frequency table, but bigger than zero.
Question: Is this statement true or false?
False
Checkpoint 1.2.100. Things to Avoid.
Violating the area principle may create misleading graphs.
Question: Is this statement true or false?
True