Distribution of Data: The Central Limit Theorem

Burgess,Christopher;

Distribution of Data: The Central Limit Theorem

October 2, 2019

By Christopher Burgess

News

Article

Pharmaceutical TechnologyPharmaceutical Technology-10-02-2019

Volume 43

Issue 10

Pages: 62–64

Using the central limit theorem concerning the distribution of means allows one to justify the assumption of the normal distribution.

Analytical people tend not to worry too much about the nature of the distribution of their data. Statisticians, on the other hand do worry. Some 10 years ago, Rebecca Elliott, a senior statistician with Eli Lilly, gave an excellent presentation on some ways users drive statisticians crazy (1). This column is about just one method to prevent that happening, as we do need statisticians occasionally.

The central limit theorem (CLT)-actually theorems as there are more than one-is at the heart of much statistical methodology. The CLT is a lifesaver for analytical data, be it continuous or discrete. The good news is that the part that must be understood is very simple. The really good news is that if the CLT did not exist, many familiar statistical methods would not be valid. The first lesson for analytical people to learn is that, from a statistical point of view, there is a world of difference between individual or replicate data from a sample and means of those data.

It is visually apparent, therefore, that to use methods based on the assumption of normality would be invalid in many instances. Fortunately, the CLT comes to our aid.

It is the mathematical model (shape) of data population which determines the applicability (suitability) of the statistical methodology. A summary of some commonly met distributions are shown in Figure 1 (2).

Figure 1: Some data distributions for continuous and discrete data (adapted from reference 2). (Figures courtesy of author)

Central limit theorem (CLT)

The simple part of the CLT is that for any sample of N independent determinations, the means of n values tend to a normal distribution irrespective of the underlying population distribution. In addition, the overall or grand mean tends to the population mean. Note that the bigger n is, the more this is true. Unlikely though this seems, some calculations can be performed in Microsoft Excel to demonstrate this property. This is not well described in books on statistics, but Basic Statistics and Pharmaceutical Statistical Applications does describe this well, albeit somewhat hidden away in the 800 pages (3). The following is one way to visualize the value of the CLT.

Let us consider a small data set of 15 integer numbers far away from a normal distribution, namely 8,8,8,9, 9,9,10,10,10,11,11,11,12,12,12, which are shown as a histogram in Figure 2.

Figure 2: Histogram of the rectangular distribution of the 15 numbers in our data set. (Figure courtesy of the author)

Let us take the decision that n is to be two and we know that there will be 105 means of two combinations of these 15 numbers. In Excel, this would be calculated using the combination formula;=COMBIN(15,2). Sadly, there is no function currently in Excel to list all of these 105 combinations, but fortunately there is a macro available on the Internet to do just that (4). Once you have these combinations listed, then you can calculate the means. The result is shown in Figure 3.

Figure 3: Histogram for the 105 means for n=2. (Figure courtesy of the author)

Looking good so far; so what about n=3? In Excel, this is calculated from the formula =COMBIN(15,3), which gives 455 combinations. Modifying the macro works well and generates the 455 combinations listing. We proceed as before by calculating the means of 3 resulting in the histogram shown in Figure 4.

Figure 4: Histogram for the 455 means for n=3. (Figure courtesy of the author)

The approximation to the normal distribution becomes even more apparent. The normal approximation becomes good when n is in the region of 25 to 30. This was demonstrated in an earlier column using the t distribution (5). This particular calculation cannot be performed in Excel because if one used N=50 data points and n=30, one would have 47,129,212,243,960 combinations, which would take, at one calculated mean per second, just under 1.5 million years to achieve even if Excel allowed that many rows.

Conclusion

The central limit theorem concerning the distribution of means allows one to justify the assumption of the normal distribution so that we can use many of the statistical formulae that require normality on our datasets. As long as mean and individual data are clearly differentiated, we will help our statistician colleagues remain sane.

References

1. R. Elliott, “Product Specifications: How to Drive a Statistician Crazy,” Pharmaceutical Statistics 2008: Confronting Controversy Conference, Arlington VA, 2008.
2. A. Damodaran, Probabilistic Approaches: Scenario Analysis, Decision Trees And Simulations, people.stern.nyu.edu/adamodar/pdfiles/papers/probabilistic.pdf, Figure 6A.15: Distributional Choices.
3. JE De Muth, Basic Statistics and Pharmaceutical Statistical Applications, 3rd edition, page 123 (CRC Press, 2014).
4. Allen Wyatt, Listing Combinations, tips.net .
5. C. Burgess, Pharmaceutical Technology, 38 (6) (June 2014).

Article Details

Pharmaceutical Technology
Vol. 43, No. 10
October 2019
Pages: 62–64

Citation

When referring to this article, please cite it as C. Burgess, "Distribution of Data: The Central Limit Theorem," Pharmaceutical Technology 43 (10) 2019.

Download Issue PDF

Articles in this issue

FT-IR Spectrometer

Shimadzu_PR_A-ENG-19011_Figure_1 225-1.jpg

Nexera Prep Series

Airnet IIs 510 no bracket right 225 copy.png

Pharmaceutical Cleanroom Monitoring

Multi-Shaft Mixers for High-Quality Gels and Creams

Stability Testing for Small-Molecule Clinical Trial Materials

Image Repair Must Get to the Root Cause

The Never-Ending Brexit?

Out with the Old and In with the New European Commission

Data Integrity Violations Draw Strong FDA Rebukes

The Fundamentals of Dissolution Testing

Overcoming Bioavailability ‘Roadblocks’ with LBDDS

The Demands of the Perfect Dose

Getting in Touch with New Guidance on Topical Products

ALCOA+ and Data Integrity

Best Practices in Handling HPAPIs

Related Content

Macro view on vials | Image Credit: © yuriygolub - stock.adobe.com

Extractable and Leachable Challenges in Lyophilized Drug Products

Alan Xu;Piet Christiaens

April 5th 2025

Article

The authors examine the risks of extractables and leachables, and present solutions that emphasize the importance of a strategic, multi-prong approach.

Drug Solutions Podcast: Applying Appropriate Analytics to Drug Development

March 26th 2024

Podcast

In this episode of the Drug Solutions Podcast, Jan Bekker, Vice President of Business Development, Commercial and Technical Operations at BioCina, discusses the latest analytical tools and their applications in the drug development market.

research, doctor, microscope, experiment, dx, digital transformation, data, analysis, science, researcher, graph, chart, pharmaceutical, pharmacy, medicine, information, development, research institute | Image Credit: ©metamorworks - stock.adobe.com

Full Tolerance Coverage Method for Assessing Uniformity of Dosage Units with Large Sample Sizes

Pramote Cholayudth

March 10th 2025

Article

The ‘full tolerance coverage method’ is introduced as a coverage estimation approach for assessing the uniformity of dosage units from large sample sizes, ensuring that no dosage unit exceeds the specification range.

Text sign showing Industry News. Business photo text delivering news to the general public or a target public | Image Credit: © Artur - stock.adobe.com

Nelson Labs Launches Rapid Sterility Testing, Cutting Incubation Times

Patrick Lavery

March 4th 2025

Article

The company now performs product-sterility testing through rapid microbiological methods at two laboratory sites in the United States and one in Germany.

Inspiration Over Desperation: Accurant Biotech CEO Talks Leadership, Innovation, and Motivation

Patrick Lavery

March 4th 2025

Article

Xiao-Yan Cai, PhD, shares insights into her leadership style, the importance of motivation in the workplace, and how she balances repetition with innovation, discussing how hobbies and resilience uniquely shape her professional approach, emphasizing the value of perseverance and preparation in both science and life.

Monitoring and course correction. Keep the process under observation and control. Calibration. | Image credit: ©Andrii Yalanskyi – Stock.Adobe.com

Outsourcing Analytics for Process Monitoring and Optimization

Cynthia A. Challener

February 5th 2025

Article

Developers save money and time while accessing expertise.