Many textbooks and courses teach an imprecise method for calculating percentiles. Their method may or may not fit the definition of percentile, but it results in a ragged graph, which implies that error has been introduced into the results.
Let’s use the following sample data as an illustration: 1, 456, 599, 782, 4568, 5312, 9185, 16458, 21854, and 45602.
Now we’ll calculate the 50th percentile (also called the median) using the universally taught method: There are ten data points, so we take the average of the 5th and 6th data points. 4,568 + 5,312 = 4,940. There is nothing wrong with this calculation, but note that we took the midpoint between the two values.
This calculation works because we were calculating the 50th percentile. However, it doesn’t necessarily work for other percentiles.
Now let’s calculate the 30th percentile using the universally taught method: The 30th percentile is the point at which 30% of the data is below that point and 70% of the data is above that point. In our sample data, which has 10 values, the 30th percentile falls somewhere between the 3rd and 4th values (i.e., somewhere between 599 and 782).
There are two commonly taught methods. One says, like we did with the median, to take the average of the two values. It computes the 30th percentile as (599 + 782) ÷ 2 = 690.5. The second commonly taught method is to take one of the two closest values. Which one to take is debatable, so I’ll choose 782. Others may have chosen 599.
The commonly taught methods result in three different possible answers. However, only one of those three answers, the midpoint method, is correct. The definition of the 30th percentile is the point at which 30% of the data is smaller and 70% of the data is larger. We see that 690.5, computed in the previous paragraph, is correct — 3 of the values (30% of 10) are smaller than 690.5 and 7 of the values (70% of 10) are larger than 690.5.
The other two methods produce erroneous results. Consider 782. Three values are smaller and six values are bigger. That is the 33⅓ percentile, not the 30th percentile. Similarly, consider 599. Two values are smaller and 7 values are bigger. That is the 22-2/9ths percentile, not the 30th percentile.
Even though the midpoint method produces a technically correct result, graphing the percentiles results in a jagged line rather than a smooth line.
This jagged line implies that we are introducing error into our result. Read on to learn how to compute percentiles correctly.
The Correct Way to Calculate Percentiles
If we refer to the individual values as xi, we call i the index. The first (i.e., smallest) value in the list is x1. Its index number is 1. Similarly, the last value in the list is x10. Its index number is 10.
To calculate a percentile, it is easy to see which i to use if we have the data sitting right in front of us and the data set is small enough. However, we sometimes need to calculate i. It is calculated as i = p × (n-1) ÷ 100 + 1 where p is the percentile we are calculating and n is the number of values in the list. If the fractional part of i is zero, then our calculation tells us which xi is our answer. We are done.
However, if our calculation returns a non-zero fractional part, that tells us our answer lies between two values. For example, if we are calculating the 30th percentile for a list of 10 values, i = 30 × (10-1) ÷ 100 + 1 = 270 ÷ 100 + 1 = 2.7 + 1 = 3.7. This tells us our answer lies somewhere between x3 and x4.
Knowing that our answer lies between two x’s does not mean we can use the midpoint between those two x’s. That would give us the jagged line in the graph above. Out of the infinite number of numbers between the two x’s, we want the one that gives us a nice smooth line? The answer is in the fractional portion of our calculated i. If our calculation tells us that i is 3.7, the “.7” tells us our answer lies 70% of the way from x3 to x4.
To finish things off, let’s calculate the 30th percentile of our example data: We calculated i as 3.7 above, which tells us our answer lies 70% of the way from 599 to 782. The difference between these two values is 782 – 599 = 183. 70% of 183 is 128.1. So the 30th percentile is 599 + 128.1 = 727.1, which is quite a bit different from the results of the commonly taught calculations.
Here is a graph of the percentiles calculated correctly (in green). The red line is a copy of the graph up above.
Nice smooth green line. Now I’m happy
FYI, the formula taught here is the one used by the OpenOffice spreadsheet.
© Copyright 2016 by Warren Gaebel, BA, BCS. All rights reserved.