Skip to main content

Section 1.4 Describing Data Numerically: Measures of Relative Standing

Introduction.

It is sometimes useful to determine how one value in a data set compares to the rest of the data in the same set. To do this, we compute a measure of relative standing for the value within its data set.

For example, when ranking our favorite pizza toppints, we are specifying the relative standing of each topping. If “mushroom” is your favorite topping, then it is ranked number one out of all possible toppings. If “onion” is your second favorite topping, then it is not as high as mushroom, but better than everything else.

This type of measure can also be useful for comparing data from two different data sets. A 70% on a math exam may be the top score, while 70% on a history exam is the bottom score. By giving the relative standing of these scores, we can more accurately compare them.

The measures of relative standing we define in this section will be based on or closely related to the measures of center and spread we saw in the previous section.

Subsection 1.4.1 Percentiles

In the previous section, we learned how to compute the five quartiles. While we used them to help us compute a measure of spread, the quartiles themselves are actually a measure of relative standing. They help us define the location of a value within the given set of data.

Obed's plumbing business charges $95 an hour for residential service. A study of residential service rates among all plumbers in the same region finds the following quartiles. How do Obed's rates compare to those of the other plubers in the region?

  • \(Q_0 = \$65\)

  • \(Q_1 = \$74\)

  • \(Q_2 = \$86\)

  • \(Q_3 = \$97\)

  • \(Q_4 = \$109\)

Solution

Obed's rate of $95 an hour is between \(Q_2\text{,}\) the second quartile or median, and \(Q_3\text{,}\) the third quartile. So he charges more than at least 50% of plumbers, but less than at most 75% of plumbers in his region.

Another related measure of relative standing closely related to quartiles is a percentile.

Definition 1.4.2.

The \(x^\text{th}\) percentile is a value that divides the bottom \(x\) percent of the data from the top \(100-x\) percent of the data in a given data set.

If you think about the relationship between percentiles and quartiles, you may recognize that quartiles are in fact specific percentiles. The first quartile, \(Q_1\text{,}\) is also the 25th percentile. The median, \(Q_2\text{,}\) is the 50th percentile. And the third quartile, \(Q_3\text{,}\) is the 75th percentile. Percentiles are seen most often in conjunction with standardized measurements, such as an infants weight or standardized test results.

Interpret the following statements involving the word percentile.

  1. When little Johnny was born, his height was only in the 20th percentile. But as he grew, he was soon taller than all of his peers.

  2. Isabella really enjoys reading. She scored in the 95th percentile for reading on her most recent achievement test.

Solution

The following statements interpret the percentiles used above.

  1. Johnny's height was in the 20th percentile, meaning that he was as tall or taller than 20% of newborns and as short or shorter than 80%. So relative to other newborns, he was fairly short.

  2. Isabella scored in the 95th percentile, meaning that her score was as good or better than 95% of children who took this achievement test. Stated from the opposite perspective, only 5% of children scored as well or better than Isabella.

Subsection 1.4.2 Five Number Summaries and Box-Plots

The median and quartiles that we saw in Subsection 1.3.3 can be used to create a graphic that represents the distribution (shape) of the data set. This graphic is referred to as a box-plot. Before we see how to build a box plot, however, let's look at the numbers that go into making the plot.

Definition 1.4.4.

The five-number summary for a set of data is the minimum, first quartile, median, third quartile, and maximum for that set of data, once any outliers have been removed.

You have hopefully noticed that the five numbers that make up this summary are the five quartiles, \(Q_0\text{,}\) \(Q_1\text{,}\) \(Q_2\text{,}\) \(Q_3\text{,}\) and \(Q_4\text{,}\) that we saw previously, assuming that there are no outliers in the data set. We will define how to identify outliers in the context of a five number summary in the following step-by-step definition of a box-plot.

Definition 1.4.5.

A box-plot, also called a box-and-whisker plot, is the graphical representation of the five number summary for a set of data. To construct a box-plot, follow these steps.

  1. Draw an axis with a scale covering at least the entire range of values in the data set.

  2. Draw a box (rectangle) with bottom at the first quartile, and top at the third quartile.

  3. Draw a line through the box, dividing it at the median.

  4. Sketch in temporary “fences” at 1.5 times the IQR above and below the edges of the box.

  5. Mark any values from the data set that lie outside these fences on the graph. They are outliers.

  6. Draw “whiskers” from the bottom of the box down to the minimum value in the data set that is not an outlier, and from the top of the box to the maximum value that is not an outlier.

Unlike a histogram, the box-plot focuses on showing the range of data. This range, excluding outliers, is divided up into quarters allowing one to see where the data is most “tightly packed” together. Let's see how this works in an example.

Construct a five-number summary and a box-plot for the data: \(\lbrace 14, 17, 10, 9, 16, 19, 12, 10, 15, 13, 42 \rbrace\text{.}\)

Solution

The first step is to compute the five number summary. To do that, we must arrange the data in order.

\begin{equation*} 9, 10, 10, 12, 13, 14, 15, 16, 17, 19, 42 \end{equation*}

Next we identify the following quartiles:

  • Since there are eleven numbers, the second quartile, or median, is right in the middle at position 6. So \(Q_2 = 14\text{.}\)

  • We will share this middle number between the upper and lower half of the data, so that there are six values in each half. The first quartile is therefore the average of the 3rd and 4th numbers, and \(Q_1 = \frac{10+12}{2} = 11\text{.}\)

  • The third quartile is the median of the top six numbers, so it is \(Q_3 = \frac{16+17}{2} = 16.5\text{.}\)

Next, we compute the interquartile range and calculate where the fences should be placed. We get \(IQR = Q_3 - Q_1 = 16.5 - 11 = 5.5\text{.}\) So the upper and lower fences are at \(Q_1 - 1.5(\text{IQR}) = 11 - 1.5(5.5) = 2.75\) and at \(Q_3 + 1.5(\text{IQR}) = 16.5 + 1.5(5.5) = 24.75\text{.}\)

Note that the maximum value in the data set, 42, is above the upper fence. So it is an outlier and will not be included in the five number summary. The minimum and maximum vales of the remaining data are 9 and 19 respectively. This gives us the following five-number summary.

\begin{equation*} \text{Min} = 9, \quad Q_1 = 11, \quad \text{Median} = 14, \quad Q_3 = 16.5, \quad \text{Max} = 19 \end{equation*}

Using these, we construct the box-plot shown below, remembering to identify the outlier with an asterisk.

A rectangle is drawn from 11 up to 16.5 with a line through the middle at 14.  Dashed lines are drawn at 2.75 and 24.75.  There is an asterisk at 42.  Finally whiskers extend from the top of the rectangle up to 19 and the bottom of the rectangle down to 9.
Figure 1.4.7. Box Plot
Figure 1.4.8. Five Number Summaries and Box-Plots I
Figure 1.4.9. Five Number Summaries and Box-Plots I

Consider the following set of data:

\begin{equation*} \lbrace 15, 25, 7, 19, 400, 27, 51, 32, 19, 77, 52, 15 \rbrace. \end{equation*}

Question: True or False: When constructing a box-plot, one of these values would be an outlier (above or below the “fences”)?

Answer

True

Consider the following set of data:

\begin{equation*} \lbrace 4, 6, 3, 9, 7, 4, 2, 6, 9, 3, 8, 5, 1 \rbrace. \end{equation*}

Question: True or False: When constructing a box-plot, one of these values would be an outlier (above or below the “fences”)?

Answer

False

Subsection 1.4.3 z-Scores and the Emperical Rule

The mean and standard deviation can also be applied to measure the relative standing of a value within its data set. To do this, we think of a “standard” deviation as a measure of typical distance from the mean. It turns out that in a typical data set, most data is within three standard deviations of the mean. Anything that is more than three standard deviations from the mean can be considered an outlier.

Determine which, if any, of the in values in the sample \(\lbrace 4, 10, 12, 13, 9, 10 \rbrace\) should be considered outliers.

Solution

There are two ways to approach this problem, depending on which types of measures we want to use.

  1. We could use the IQR to determine if a value is an outlier. In this method, we will find the first and third quartiles and then take anything that is more than \(1.5 \times \text{IQR}\) away from these values to be an outlier.

    Arranging the data in order gives:

    \begin{equation*} 4, 9, 10, 10, 12, 13. \end{equation*}

    So the quartiles are \(Q_1 = 9\) and \(Q_3 = 12\text{.}\) Thus, the IQR is \(12-9 = 3\) and our fences are at

    \begin{equation*} 9 - 1.5(3) = 4.5 \qquad \text{and} \qquad 12 + 1.5(3) = 16.5. \end{equation*}

    Therefore, 4 is an outlier on the low side.

  2. Using the mean and standard deviation gives us another way to check for outliers. Computing these gives:

    \begin{equation*} \overline{x} = \frac{58}{6} \approx 9.7 \end{equation*}
    \begin{equation*} \scriptstyle s = \sqrt{\frac{(4-9.7)^2 + (10-9.7)^2 + (12-9.7)^2 + (13-9.7)^2 + (9-9.7)^2 + (10-9.7)^2}{5}} \approx 3.1. \end{equation*}

    So outliers will be less than \(\overline{x} - 3s = 9.7 - 3(3.1) = 0.4\) or bigger than \(\overline{x} + 3s = 9.7 + 3(3.1) = 19\text{.}\) Therefore, there are no outliers using this method.

This idea of measuring how unusual a value is by looking at the number of standard deviations it lies away from the mean is an important one for us to understand. We will use this concept many times in the chapters to come. Because of its usefulness, there is a special name for this measure.

Definition 1.4.13.

The z-score of a value is the number of standard deviations that it lies away from the mean. To compute the z-Score of a value x, use one of the following formulas:

\begin{equation*} \text{Population:}\quad z= \frac{x -\mu}{\sigma} \end{equation*}
\begin{equation*} \text{Sample:}\quad z = \frac{x-\overline{x}}{s} \end{equation*}

Note that in both cases the z-score is computed by subtracting the mean from the value \(x\) and dividing by the standard deviation.

Refining our second method in Example 1.4.12, we can check individual values to determine their distance away from the mean in terms of number of standard deviations. We do this by computing their z-scores.

In the last example, we saw that 4 was a potential outlier according to the IQR. The mean of the data set was \(\overline{x} = 9.7\) and the standard deviation was \(s = 3.1\text{.}\) Find and interpret the z-score for 4 in this data set.

Solution

The z-score for 4 is:

\begin{equation*} z = \frac{4 - 9.7}{3.1} \approx -1.83 \end{equation*}

This means that 4 lies approximately 1.83 standard deviations below (because the z-score is negative) the mean. Since it is not at least three standard deviations above or below the mean, we do not consider it an outlier.

Another useful application of z-scores is to compare values from two different data sets. A z-score gives us a measure of relative standing. That is, how far above or below the mean is this value compared to the rest of the data in that data set.

Using z-scores, determine which is better. A score of 70% on an exam with an average score of 65% and a standard deviation of 2.5%, or a score of 70% on an exam with an average score of 65% and a standard deviation of 15%.

Solution

The first exam score of 70%, or 0.70, has a z-score of

\begin{equation*} z_1 = \frac{0.70 - 0.65}{0.025} = 2.0. \end{equation*}

The second exam score of 70%, or 0.70, has a z-score of

\begin{equation*} z_2 = \frac{0.70-0.65}{0.15} \approx 0.333. \end{equation*}

Since \(z_1 > z_2\text{,}\) the score on the first exam is “higher” than the score on the second exam relative to the rest of the data.

Even though both scores are exactly the same value, 70%, a 70% on the first exam is actually a better score than a 70% on the second exam. This is because the first exam has a smaller standard deviation, meaning that the scores are more tightly grouped around the mean of 65%.

In the special case where we know that our data has a mound-shaped distribution, meaning that if we made a histogram it would be mound shaped with a mode right in the center and symmetric sides, we can use the z-scores to predict how much of the data will fall into certain ranges.

Definition 1.4.16.

The empirical rule states that in a mound-shaped distribution,

  • 68% of the values will lie within one standard deviation of the mean (z-score between -1 and 1),

  • 95% of the values will lie within two standard deviations of the mean (z-score between -2 and 2), and

  • 99.7% of the values will lie within three standard deviations of the mean (z-score between -3 and 3).

You can see then why a z-score of less than -3 or more than 3 would make a value an outlier. Only 0.3% of values should be in this range. To help you visualize the empirical rule, take a look at the following picture.

Figure 1.4.17. Illustration of the Empirical Rule

Now let's apply the empirical rule to an example.

Suppose that the distribution of scores on a standardized exam is mound-shaped with a mean of \(\mu = 102.5\) and standard deviation of \(\sigma = 9.8\text{.}\) Answer the following questions.

  1. What percent of the students scored between 82.9 and 122.1?

  2. Between what two scores did 68% of students score?

  3. The top 0.15% of students made at least what score?

Solution
  1. Since \(82.9 = 102.5 - 2(9.8)\) and \(122.1 = 102.5+2(9.8)\text{,}\) the range extends two standard deviations above and below the mean. So the empirical rule says that 95% of students will score in this range.

  2. According to the empirical rule, 68% of students scored within one standard deviation of the mean. So this range is from \(102.5 - 1(9.8) = 92.7\) and \(102.5 + 1(9.8) = 112.3\text{.}\)

  3. Finally, The top 0.15%, which is half of 0.3%, scored at least 3 standard deviations above the mean. So their score is at least \(102.5 + 3(9.8) = 131.9\text{.}\)
Figure 1.4.19. z-Scores and the Empirical Rule I
Figure 1.4.20. z-Scores and the Empirical Rule II

Sam had three midterm exams, and he scored as follows on these exams.

  • English.

    He scored a 43 on his English exam. The exam had a mean of 40 and a standard deviation of 7.

  • Math.

    On his math exam, Sam scored 78. The exam had a mean of 73 and a standard deviation of 9.5.

  • History.

    Finally, on the history exam Sam scored 65. This exam had a mean of 60 and standard deviation of 8.2.

Question: Using z-scores, determine on which exam Sam actually scored best, relative to the rest of the class.

Answer

History

A widget manufacturer produces widgets whose weights have a mound-shaped distribution with a mean of 14 oz. and standard deviation of 1.7 oz.

Question: Approximately what percent of the widgets produced will weigh between 8.9 and 19.1 oz?

Answer

99.7%

The length of time that a typical student spends studying per day has a mound-shaped distribution with mean \(\mu = 161\) minutes and standard deviation \(\sigma = 51\) minutes.

Question: Use z-scores to determine which, if any, of the following study times should be considered outliers: 2 minutes, 29 minutes, 105 minutes, 294 minutes, and 365 minutes

Answer

2 minutes and 365 minutes are outliers

Subsection 1.4.4 Chebyshev's Inequality

While the empirical rule gives us a good way to estimate how much of a data set will lie within one, two, or three standard deviations of the mean, it does have limitations. First, it only works if we know the distribution is mound-shaped. If the distribution is skewed, bimodal, or otherwise not mound shaped the empirical rule will not give us accurate results. The other drawback is that this rule only tells us how much of the data is within the three ranges given: one, two, or three standard deviations on each side of the mean. The next rule we will examine is much more general.

Now let's revisit Example 1.4.18 and see how this new rule can be applied.

Suppose that the mean score on a standardized exam is \(\mu = 102.5\) and standard deviation is \(\sigma = 9.8\text{.}\) Without assuming that the distribution of scores is mound shaped, answer the following questions.

  1. What percent of the students scored between 82.9 and 122.1?

  2. What percent of students will score between 87.8 and 117.2?

Solution

We can no longer use the empirical rule, since we do not know if the distribution is mound shaped. Instead, we use Chebyshev's Inequality.

  1. Our first step is to figure out how many standard deviations 82.9 and 122.1 are on either side of the mean. This is the value of \(k\) in Chebyshev's inequality. We compute that by finding the difference between the mean and the lower bound of our range and dividing by \(\sigma\text{.}\)

    \begin{equation*} k = \frac{\mu - \text{lower bound}}{\sigma} = \frac{102.5-82.9}{9.8} = 2 \end{equation*}

    We could have found the same value using the upper bound.

    \begin{equation*} k = \frac{\text{upper bound} - \mu}{\sigma} = \frac{122.1-102.5}{9.8} = 2 \end{equation*}

    Therefore, the percent of values in this range is

    \begin{equation*} P \geq 1-\frac{1}{2^2} = 1 - \frac{1}{4} = \frac{3}{4} = 0.75 \text{ or } 75\% \end{equation*}

    Note that this is less than the 95% the empirical rule gave us. That is because we can not be as certain how the data is distributed if we don't know the shape of the distribution.

  2. For the second part, we again start by finding the value of \(k\text{.}\)

    \begin{equation*} k = \frac{102.5 - 87.8}{9.8} = 1.5 \end{equation*}

    So our range is 1.5 standard deviations on either side of the mean. Plugging this in we get

    \begin{equation*} P \geq 1-\frac{1}{(1.5)^2} = 1-\frac{1}{2.25} \approx 0.5556. \end{equation*}

    So at least 55.6% of the data is within 1.5 standard deviations of the mean.

    Note that this problem would not be possible with the empirical rule, even if we knew the distribution was mound shaped. 1.5 standard deviations is not one of the three options given by the empirical rule.

Chebyshev's inequality relates several things together:

  • the mean of the data set,

  • The standard deviation of the data set,

  • a range extending a certain number of standard deviations above and below the mean, and

  • a minimum percent of the data.

If we have any three of these values, we can use Chebyshev's inequality to relate them all together. Consider the following examples.

Montly sales for a particular retail business follow an unknown distribution with a mean of \(\mu = \$92,000\) and a standard deviation of \(\sigma = \$12,500\text{.}\) Use Chebyshev's inequality to determine between what minimum and maximum amount the montly sales figure will lie at least 75% of the time.

Solution

In this problem we are given the value of \(P\text{,}\) the proportion that lies within \(k\) standard deviations of the mean. So we will use it to find \(k\text{.}\)

\begin{align*} P \geq 1 - \frac{1}{k^2} \amp \Rightarrow 0.75 \geq 1 - \frac{1}{k^2}\\ \amp \Rightarrow 0.75 - 1 \geq - \frac{1}{k^2}\\ \amp \Rightarrow 0.25 \leq \frac{1}{k^2}\\ \amp \Rightarrow 4 \geq k^2\\ \amp \Rightarrow k \leq 2 \end{align*}

So the range of monthy sales is at most two standard deviations on either side of the mean, or from a minimum of \(\$92,000-2(12,500) = \$67,000\) to a maximum of \(\$92,000+2(12,500) = \$117,000\)

The Matabolism rate of a drug is the rate at which the drug is eliminated from a patient's system. It is often measured in half-life, or the time it takes for 50% of the drug to be eliminated in the system. Suppose that for a certain drug, the average half-life in a general patient is known to be \(\mu = 5.5\) hours. A recent study found that in at least 99% of patients, the half-life was somewhere between 3.8 and 7.2 hours. Use Chebyshev's inequality to determine the standard deviation of the drug half-life.

Solution

As in Example 1.4.25, we are given a range of values. In that example, we used it to find \(k\)k, but this required that we know the standard deviation \(\sigma\text{.}\) But we are also given a percent of the population in the range, as in Example 1.4.26. So we can start by finding \(k\) with this information.

\begin{align*} P \geq 1 - \frac{1}{k^2} \amp \Rightarrow 0.99 \geq 1 - \frac{1}{k^2}\\ \amp \Rightarrow 0.01 \leq \frac{1}{k^2}\\ \amp \Rightarrow 100 \geq k^2\\ \amp \Rightarrow k \leq 10 \end{align*}

Thus, we know that the maximum range will be from 10 standard deviations below the mean to 10 standard deviations above the mean. Or, using the minimum and expressing this symbolically,

\begin{align*} 3.8 = \mu - k\sigma \amp \Rightarrow 3.8 = 5.5 - 10\sigma\\ \amp \Rightarrow 3.8 - 5.5 = -10\sigma\\ \amp \Rightarrow \sigma = \frac{3.8 - 5.5}{-10} \approx 0.17. \end{align*}

Thus, the standard deviation is approximately 0.17 hours, or 10.2 minutes.

Figure 1.4.28. Using Chebyshev's Inequality I
Figure 1.4.29. Using Chebyshev's Inequality II

The distribution of ages for college students is known to be highly skewed to the right. Suppose that at a certain university the average age is 21.7 years with a standard deviation of 0.56 years.

Question: At least what percentage of the student body is between 20.3 and 23.1 years old?

Answer

84%

Chebyshev's inequality can be used with which of the following distribution shapes?

  • Mound-Shaped

  • Skewed Left

  • Skewed Right

  • Uniform

Answer

All of these

Suppose that the weight of a certain mellon has an unknown distribution, but the mean is known to be 4.2 pounds and the standard deviation is 1.28 pounds.

Question: According to Chebyshev's inequality, 75% of mellons will weigh between 1.64 pounds and what maximum weight?

Answer

6.76 pounds