Distribution of Data: The Central Limit Theorem

Publication
Article
Pharmaceutical TechnologyPharmaceutical Technology-10-02-2019
Volume 43
Issue 10
Pages: 62–64

Using the central limit theorem concerning the distribution of means allows one to justify the assumption of the normal distribution.

Analytical people tend not to worry too much about the nature of the distribution of their data.  Statisticians, on the other hand do worry.  Some 10 years ago, Rebecca Elliott, a senior statistician with Eli Lilly, gave an excellent presentation on some ways users drive statisticians crazy (1).  This column is about just one method to prevent that happening, as we do need statisticians occasionally. 

The central limit theorem (CLT)-actually theorems as there are more than one-is at the heart of much statistical methodology. The CLT is a lifesaver for analytical data, be it continuous or discrete.  The good news is that the part that must be understood is very simple.  The really good news is that if the CLT did not exist, many familiar statistical methods would not be valid.  The first lesson for analytical people to learn is that, from a statistical point of view, there is a world of difference between individual or replicate data from a sample and means of those data.

It is visually apparent, therefore, that to use methods based on the assumption of normality would be invalid in many instances. Fortunately, the CLT comes to our aid.

It is the mathematical model (shape) of data population which determines the applicability (suitability) of the statistical methodology.  A summary of some commonly met distributions are shown in Figure 1 (2).

Figure 1: Some data distributions for continuous and discrete data (adapted from reference 2). (Figures courtesy of author)

Central limit theorem (CLT)

The simple part of the CLT is that for any sample of N independent determinations, the means of n values tend to a normal distribution irrespective of the underlying population distribution. In addition, the overall or grand mean tends to the population mean. Note that the bigger n is, the more this is true. Unlikely though this seems, some calculations can be performed in Microsoft Excel to demonstrate this property. This is not well described in books on statistics, but Basic Statistics and Pharmaceutical Statistical Applications does describe this well, albeit somewhat hidden away in the 800 pages (3). The following is one way to visualize the value of the CLT.

 

 

Let us consider a small data set of 15 integer numbers far away from a normal distribution, namely 8,8,8,9, 9,9,10,10,10,11,11,11,12,12,12, which are shown as a histogram in Figure 2.

Figure 2: Histogram of the rectangular distribution of the 15 numbers in our data set. (Figure courtesy of the author)

Let us take the decision that n is to be two and we know that there will be 105 means of two combinations of these 15 numbers. In Excel, this would be calculated using the combination formula;=COMBIN(15,2). Sadly, there is no function currently in Excel to list all of these 105 combinations, but fortunately there is a macro available on the Internet to do just that (4). Once you have these combinations listed, then you can calculate the means. The result is shown in Figure 3.

Figure 3: Histogram for the 105 means for n=2. (Figure courtesy of the author)

Looking good so far; so what about n=3? In Excel, this is calculated from the formula =COMBIN(15,3), which gives 455 combinations. Modifying the macro works well and generates the 455 combinations listing. We proceed as before by calculating the means of 3 resulting in the histogram shown in Figure 4.

Figure 4: Histogram for the 455 means for n=3. (Figure courtesy of the author)

The approximation to the normal distribution becomes even more apparent. The normal approximation becomes good when n is in the region of 25 to 30. This was demonstrated in an earlier column using the t distribution (5). This particular calculation cannot be performed in Excel because if one used N=50 data points and n=30, one would have 47,129,212,243,960 combinations, which would take, at one calculated mean per second, just under 1.5 million years to achieve even if Excel allowed that many rows.

Conclusion

The central limit theorem concerning the distribution of means allows one to justify the assumption of the normal distribution so that we can use many of the statistical formulae that require normality on our datasets. As long as mean and individual data are clearly differentiated, we will help our statistician colleagues remain sane.

References

1. R. Elliott, “Product Specifications: How to Drive a Statistician Crazy,” Pharmaceutical Statistics 2008: Confronting Controversy Conference, Arlington VA, 2008.
2. A. Damodaran, Probabilistic Approaches: Scenario Analysis, Decision Trees And Simulations, people.stern.nyu.edu/adamodar/pdfiles/papers/probabilistic.pdf, Figure 6A.15: Distributional Choices.
3. JE De Muth, Basic Statistics and Pharmaceutical Statistical Applications, 3rd edition, page 123 (CRC Press, 2014).
4. Allen Wyatt, Listing Combinations, tips.net.
5. C. Burgess, Pharmaceutical Technology, 38 (6) (June 2014).

Article Details

Pharmaceutical Technology
Vol. 43, No. 10
October 2019
Pages: 62–64

Citation

When referring to this article, please cite it as C. Burgess, "Distribution of Data: The Central Limit Theorem," Pharmaceutical Technology 43 (10) 2019.

Recent Videos
Lee Cronin, founder and CEO of Chemify
Related Content