Precedents set in the historic Barr case continue to raise questions over suitable sample-size criteria.
The sample size needed to investigate an out-of-specification (OOS) result continues to be a point of contention within the pharmaceutical industry. The following quotes from the US vs. Barr Laboratories case, which covered OOS issues, provide good background to the discussion (1):
Lynn D. Torbeck
"... the number of retests performed before a firm concludes that an unexplained out of specification result is invalid or that a product is unacceptable is a matter of scientific judgment."
"Nevertheless, retesting cannot continue ad infinitum. Because such a practice is not scientifically valid ..."
"Such a conclusion cannot be based on 3 of 4 or 5 of 6 passing results, but possibly 7 of 8."
" ... retesting determinations will vary on a case by case basis, a necessary corollary of which is that an inflexible retesting rule, designed to be applied in every circumstance, is inappropriate."
The specific question here is: "How big should the sample be?" and it is the most commonly asked question of a professional statistician. The truth is that there is no simple answer because it depends on the information available. There are at least four possible approaches:
Scientific estimate
One approach to determine the sample size would be to ask a trained analyst with industry experience to use his or her best scientific judgment to determine what would be an adequate sample size to select. Prior to the Barr case this was (and still is in some places) the standard operating procedure. Although this has scientific basis in the fact that the best person available uses all information to estimate a size, the fact that different people would not get the same number is not scientific. However, the Barr case quotes would seem to support this approach.
Seven out of eight
As in the Barr case, the seven out of eight rule is used by some companies in the absence of any other available information. However, so far there is little statistical justification for seven versus any other value. Some companies choose nine only so they can say they have improved upon the Barr criteria.
Statistical formula with historical estimates
As I tell my clients, the statistical answer to the sample-size question is: "We first need a prior estimate of the inherent variability, the variance under exactly the same conditions to be used, an estimate of the alpha risk level, the beta risk level, and the size of the difference to be detected."
The formula for the sample size for a difference from a mean is:
where tα and tβ are the one sided t distribution values for the given α and β risk levels selected and S2 is the variance of the total product, process or method and d is the difference to be detected.
Four values are needed to calculate the sample size. The alpha and beta errors are standardized for most scientific and industrial applications to α = 0.05, β = 0.05 or 0.10. Thus, the t values are taken from the standard t table for α and β and a given number of degrees of freedom of the data used to estimate the variance. The other two values are more difficult to obtain.
According to FDA, "The number may vary depending upon the variability of the particular test ..." (2). This prior estimate of the variance of the method for a given product may be difficult to obtain. If the product, process or method has been changed, the data must be limited to that last change to be representative. Also, some products are made only a few times a year. There may be only three or four batches and thus three or four values. This amount is not enough to get a good estimate of the variance. If sufficient data does exist, from historical records, then perhaps the estimate can be made. A sample size of 30 or more is preferred to obtain a reasonable estimate of the standard deviation.
The size of the difference to be detected is difficult to determine in advance because one does not know in advance how far out of specification any future OOS result may be.
If the specification is 95% and the OOS is 89%, then the difference to be detected is 6%. But if the OOS is 94.4%, the difference to be detected is 0.6%. These would give very different sample sizes.
Thus, there seems to be an inherent and unintended conflict within the industry on sample size. One is not allowed to adjust the number of retests depending on the results obtained, but that is the very information we need to statistically and scientifically determine the sample size.
To determine the sample size in advance without knowing how far out of specification the OOS result will be, one would need to decide on a difference to detect in advance. But how to select this difference? Should it be the best guess of the analysts? How does one justify that guess? Should it be the bias in the method from the validation, if it exists? If the bias is large, the sample size would be small. If the bias is very small, the sample size will be large, as can be seen from the equation. This seems to be the opposite of what industry wants to achieve.
Statistical formula with sample
Equation 1 can also be used if a first sample size (e.g., seven) is available to estimate the variance. With this variance estimate and the difference between the specification and the OOS result, the sample size needed can be recalculated. Additional samples would be taken to meet the sample size if greater than seven.
Equation 1 assumes a continuous response that is normally distributed. Some data, such as for a limulus amebocyte lysate test, may be skewed, and colony counts are both discrete and skewed, so a different model and formula must be used to get the estimate. There are books and computer programs dedicated to determining the sample size in different situations.
Further, from a laboratory management point of view, should a different number of OOS retests be pursued for each method? Do the statistical and scientific advantages of different sample sizes outweigh the need for consistency for the analysts to prevent confusion and mistakes? Are we out of compliance if the analyst does eight retests when the method calls for seven?
Conclusion
To conclude, there seems to be an inherent conflict in the industry's position on sample size. Given this discussion, the seven out of eight criteria given in the Barr case may be as good as any.
Lynn D. Torbeck is a statistician at PharmStat Consulting, 2000 Dempster, Evanston, IL 60202, tel. 847.424.1314, LDTorbeck@PharmStat.com, www.PharmStat.com.
References
1. United States vs. Barr Laboratories, Inc. Civil Action No. 92-1744, US District Court for the District of New Jersey: 812 F. Supp. 458. 1993 US Dist. Lexis 1932; 4 Feb. 1993, as amended 30 Mar. 1993.
2. FDA, Investigating Out-of-Specification (OOS) Test Results for Pharmaceutical Production (Rockville, MD, Oct. 2006).
The author would like to extend an open-ended invitation to those interested in this issue to send their comments and solutions to LDTorbeck@PharmStat.com. Given adequate response, the information will be shared in a future column.