The benefits of zero tolerance as a test criterion have been oversold. A critical examination of zero tolerance reveals that many of the supposed benefits are not attainable. More important, inappropriate application of this criterion can have a deleterious effect on the assessment, control, and improvement of the quality of pharmaceutical products.
A zero-tolerance criterion is a requirement in the testing of pharmaceutical products that specifies that none of the test results from a small sample of a batch can be outside certain fixed limits. Zero tolerance is related to the concept of zero defects in quality control inspection and also to the concept of zero-accept-number attribute sampling plans (1). Such a criterion is contained in proposals under consideration for testing delivered-dose uniformity of orally inhaled and nasal drug products. Specifically, the zero-tolerance criterion of strongest current concern is contained in the US Food and Drug Administration's 2002 chemistry, manufacturing, and controls (CMC) guidance for nasal spray, inhalation solution, and suspension drug products and in the 1998 draft guidance for CMC for metered dose inhalers and dry powder inhalers (2, 3). For orally inhaled and nasal-spray drug products (OINDPs), the test for delivered-dose uniformity is a two-stage, multicriteria procedure with one of the criteria being that none of the units tested may have a measured amount of active ingredient falling outside 75–125% of label claim (i.e., there is "zero tolerance" for any result falling outside 75–125% of label claim).
A working group comprising members from FDA's Office of Pharmaceutical Science and the International Pharmaceutical Aerosol Consortium on Regulation and Science (IPAC-RS) are engaged in a cooperative effort to produce a mutually acceptable quality standard for delivered-dose uniformity of OINDPs. The current proposed procedure from IPAC-RS under consideration does not include the zero-tolerance criterion (4). Rather, controls are based on a linear combination of the mean and standard deviation using a tolerance-interval approach. This article presents a rationale that supports the elimination of the zero-tolerance criterion associated with the delivered-dose uniformity acceptance criteria for OINDP from the issued and draft guidances.
Pharmacopeial tests versus quality control batch-release procedures
The zero-tolerance criterion contained in FDA's draft guidance is a variant of the test for uniformity of dosage units described in the United States Pharmacopeia (USP) (5). As a general rule, it is inadvisable to adopt a pharmacopeial test as a sampling and acceptance plan for batch release and stability testing because the operating characteristics of the test will almost certainly fail to provide the desired levels of quality assurance. To understand why pharmacopeial procedures are not necessarily suitable for routine batch release, it may be helpful to briefly review the differences between pharmacopeial testing and batch-release testing.
The methods and criteria in the USP tests are specific and prescriptive regarding the number of units to test, the number of stages to use, and the criteria to judge acceptability. This degree of specificity is entirely appropriate because the USP tests strictly apply to specimens of the drug product, not the batch. Although some USP tests appear to be sampling and acceptance plans, it is a misunderstanding to regard them as such (5). The USP carefully reminds readers that test results apply only to the specimen tested and that although some tests involve more than one dosage unit, the result obtained is a "singlet" determination. Thus, in USP testing, the sample size is actually always one, even though several units may be used to generate the result.
The USP cautions that extrapolation of the results to a larger population are neither specified nor proscribed, but it also states "any specimen tested ... shall comply." Thus a prudent manufacturer's quality control procedures, including any final sampling and acceptance plans, must be designed to provide a reasonable degree of assurance that any specimen from the batch in question, if tested according to the USP procedure, would comply. To provide the desired level of assurance, a manufacturer's procedures may need to use a different number of units and/or a different set of criteria than actually stated in the USP test. Torbeck provides an excellent discussion of the philosophical difference between USP tests and batch-release tests (6).
For testing delivered-dose uniformity of OINDPs, the number of units tested and the criteria of the USP test, particularly the zero-tolerance criterion, are too restrictive to be useful as a batch-release procedure.
Psychological barriers to eliminating zero tolerance
Support for the elimination of zero-tolerance acceptance criteria is not universal because of the erroneous belief that it provides assurance that there are not any non-conforming units within a batch. Zero tolerance has been considered a safety net, which unfortunately conveys the impression that nonconforming units are somehow eliminated (7). This safety net, however, is primarily an illusion because no test that relies on sampling a portion of the batch can guarantee the elimination of all nonconforming units in the remainder of the batch. Hence, we are left with the inescapable realization that not all nonconforming units can be found by any sampling and testing short of 100% screening—which itself still has some percentage of error and which, in any case, is impracticable for destructive tests such as the delivered dose uniformity test. Imposing a zero-tolerance criterion cannot change this fact.
It has been said that as a safety net, zero tolerance is intended to reduce the likelihood that a unit in a batch will deviate substantially from label claim (7). In other words, zero tolerance is intended to reduce the likelihood that nonconforming units will exist in a batch. The laws of probability, however, are simply against achieving much success in such an endeavor using a zero-tolerance criterion as a tool. In a vast majority of cases, depending on the choice of zero-tolerance limits, the incidence of nonconforming units will be extremely low. Intuition and common sense tell us that most of the samples from such product will fail to contain those rare nonconforming units, so that the chances of finding and removing the nonconforming units, if they exist, are indeed very slim. Adding a zero-tolerance criterion does not materially improve the likelihood of detecting and removing a nonconforming unit. It merely provides a mathematical definition—one out of many possible definitions—of a nonconforming unit.
Another psychological issue with a zero-tolerance criterion stems from the failure to clearly distinguish between the sample and the batch. Naturally, we recognize that what we should infer about the batch comes from what we observe in the sample. Nonetheless, there is the tendency to believe that "none in the sample" means that there are "none in the batch." In turn, this can lead to the false sense of security in believing that batches passing the zero-tolerance criterion are free from nonconforming units. Sometimes, the unfortunate result of this false sense of security is that truly effective measures such as defect prevention are not considered because they may be thought to be unnecessary.
Another particularly bothersome psychological barrier is the tendency to regard zero tolerance as the ultimate in due diligence. It may be that such a misconception stems from the idea that zero defects should be the norm in the pharmaceutical industry. In fact, zero defects actually should be the goal, and all reasonable measures should be instituted to ensure that drug products are maintained at the highest quality possible. The problem is that zero tolerance in a sampling and inspection plan is arguably one of the least effective means of accomplishing this goal. Thus, instead of constituting the ultimate in due diligence, zero tolerance may actually be just the opposite, especially when it is substituted for more effective measures directed toward prevention. In quality control, it is generally accepted that the purpose of final product sampling and acceptance is to verify that the control measures designed to ensure quality have been effective and not to screen out less-than-desirable product. Yet, this principle often seems to get lost or forgotten in discussions about zero tolerance.
Finally, some may feel that even if the zero-tolerance criterion is inefficient, it is useful as a type of "threat measure" or as a "punishing rod" that forces manufacturers to improve quality. To those who subscribe to such philosophy, it does not matter that the "rod" is inefficient and crude. According to them, it does its job because it threatens producers and stimulates them to improve.
Such an approach, however, overlooks some fundamental issues. First, the pass-or-fail decisions of a zero tolerance–type test is often only weakly connected to the actual quality of a batch, which makes it more of a gambling tool than a quality control test or a stimulant for improvement.
Second, if the "rod" approach is pursued, those affected will do just enough to avoid the punishment and no more. A more efficient strategy from a public health as well as from an economic perspective is not a rod but rather a reward approach that provides incentives for thorough quality assessment. This approach also rewards products of superior performance, for example, with reduced testing, as designed in the IPAC-RS approach.
Finally, the primary purpose of pharmacopeial testing is not to stimulate improvements in quality. The purpose of such testing is to ensure that a product meets predefined standards and, thus, is fit for use. Improvements will certainly happen, but by means of completely different agents and factors not as a result of incongruous testing requirements.
Technical reasons for eliminating the zero-tolerance criterion
The zero-tolerance criterion is similar to a zero-accept criterion in attribute sampling and acceptance plans. In this context, there are several features of zero-accept and zero-tolerance that either make it inapplicable to delivered-dose uniformity testing or that constitute major drawbacks that make it a poor choice for this application.
Problems with converting continuous measurements to counts or classifications
Zero-accept criteria apply to sampling for attributes (e.g., yes–no, black–white, good–bad), in which the characteristic can be clearly counted or classified without error. When the results of testing are continuous data and subject to measurement variation, classification becomes dubious and highly problematic.
For example, consider the futility of classifying beads as black or white in a large container in which the color of the beads ranges continuously from dark gray to light gray. This example illustrates at least three reasons why counting or classifying should not be applied to continuous data such as delivered-dose uniformity testing. First, where does one place the limit that separates white from black or good from bad? And, even with a limit established, such as "75% of label claim," is there really any difference between a unit measuring 74% of label and one measuring 76% of label? Certainly, any difference that may exist does not warrant classifying one unit as bad and the other as good.
Second, the risk of misclassification is high when the measurement process cannot distinguish. When the consequences of misclassification are nontrivial, such as in batch sampling and acceptance, it is especially egregious to classify continuous measurements into bins such as "good" and "bad." A producer faces a significant economic loss when batches containing only "good" units must be discarded because an assay result for a unit happens to fall just below the limit resulting only from measurement variation. Conversely, it would be a gross disservice to patients if a batch containing a significant proportion of "bad" units was accepted just because the assay results for the units tested happened to fall above the limit. In either case, when significant measurement variation exists, the costs of misclassification are of sufficient magnitude to offset any small gain in simplicity that might be afforded by reducing continuous data to counts or classifications.
Finally, when converting continuous data to attributes such as counts or classifications, much, if not most, of the useful information is discarded. Discussing potential criteria for batch quality, Flann states that the obvious problem with classifications using ranges (e.g., 75–125% of label claim) is that they imply all units with the range are pharmacologically equally satisfactory and that those outside the range are equally unsatisfactory (8). Flann further argues that because pharmacological quality, as measured by blood levels, is likely to be a continuous function of drug content, using an acceptance range as an intrinsic measure of unit quality and the percentage beyond as a measure of batch quality should be eliminated. From simply an intuitive standpoint, shouldn't we require that the selected control procedure be less likely to accept product that has units at 60% of label than product that has units at 80% of label? Unfortunately, attribute testing in the form of classification does not distinguish between such scenarios.
A zero-accept plan may not be the best choice
Even in cases in which attribute testing is applicable, a zero-accept criterion is only one of several available options when designing a workable acceptance procedure, and it is often one of the least desirable. Compare, for example, the operating characteristics of the three plans depicted in Figure 1.
Figure 1: Comparison of three alternative plans with different accept numbers. Each plan consists of obtaining a random sample of n units counting the number of nonconforming units in that sample. If there are no more than c nonconforming units, then the batch is accepted.
In this example, one may suppose the plans are designed to achieve a 5% risk of accepting batches that contain 5% nonconforming units. Each plan consists of obtaining a random sample of n units counting the number of nonconforming units in that sample. If there are no more than c nonconforming units, then the batch is accepted. Suppose for this example that the nonconformance is of a type such that a level of 0.5% is considered acceptable. Comparing the operating characteristics of these plans, the zero-accept plan (c = 0) has a large risk of rejecting "good" product (i.e., when the percent nonconforming is <0.5%). In such a case, the zero-accept plan with a sample size of 59 would be discarded in favor of a different plan with a larger sample size n and a larger accept number c. Thus, sampling plans with a zero-accept criterion often are not the best choice amongst available alternatives.
A zero-accept plan may be counterproductive
A zero-accept criterion removes flexibility when designing a sampling and acceptance plan. When imposed without regard to the requirements of the situation or the realities of the manufacturing technology, maintaining a zero-accept criterion can lead to minimalist strategies. Consider the operating characteristics of the three plans depicted in Figure 2, all having a zero-accept criterion but different sample sizes.
Figure 2: Comparison of three zero-accept plans having different sample sizes. Note that all of the operating characteristic curves are concave, and increasing the sample size increases the risk of rejection, regardless of the quality of the batch.
One of the most obvious features of the operating characteristic curves is that they are all concave, which can be mathematically proven to be true for all zero-accept plans. Nonetheless, this fact is of much greater significance than pure academic interest because increasing the sample size increases the risk of rejection, regardless of the quality level. This anomaly puts the producer in the untenable situation in which the probability of rejection depends not on the quality of the batch as it should, but on whether a large or small sample is taken. Thus, this feature of zero-accept plans eliminates the benefit normally achieved by taking large sample sizes, which is to simultaneously decrease both the risk of rejecting acceptable product (i.e., producer's risk) and the risk of accepting rejectable product (i.e., consumer's risk). In statistical terms, the efficiency of the test should typically increase with sample size. Zero-accept plans lack this desirable property.
Zero-accept plans lack fexibility and can force minimalist strategies
With zero-accept plans, it is almost impossible to achieve both a predetermined acceptable quality level (AQL) and a prespecified rejectable quality level (RQL). One must choose one and accept whatever the other happens to be. In quality control, the consumer risk (i.e., RQL) is often specified. In such cases, an imposed zero-accept criterion forces a producer to choose the smallest sample size that meets a specified RQL. Testing a larger number of units to better characterize the batch must be avoided, because doing so only increases the risk that the batch will fail the criterion by pure chance—even when the quality level of the batch is more than acceptable. On the other hand, even with the chosen minimal sample size, the risk of rejecting acceptable product still may be unacceptably high. Thus, the imposition of a zero-accept criterion may prevent a manufacturer from achieving a realistic level of risk. In this situation, whether batches are accepted or rejected becomes little more than a chance, and a prudent manufacturer should and will perform such a testing procedure as few times as possible.
Inducing a minimalist strategy by insisting on zero tolerance has important negative consequences in the pharmaceutical industry. For example, to better characterize the process, as in a valiation effort, it is generally desirable to gather more data than what is done in routine testing. Yet, if it is not possible to do so without increasing the risk of failing the validation, the optimal strategy is to conduct minimal testing to prevent getting penalized by chance.
Another example is in stability studies where a test is normally repeated over several time intervals. Even when the characteristic under study does not actually change over time, the risk of a stability failure increases as the number of time points, and hence observations, increases. Consequently, the prudent strategy is to perform the fewest possible number of tests, even though it might have been helpful to better characterize the actual stability of the particular analytical property in question.
The multiplicity problem—that is, the increased risk to fail because of repeated testing of the same batch—is a reality for all types of test plans and is not limited to zero-tolerance plans. A reasonable strategy for such situations would be to design a global test strategy in which the overall failure risk is controlled to a certain level rather than the failure risk of individual tests. Unfortunately, zero tolerance can make it difficult or impossible to design an effective test strategy.
Nonparametric versus parametric testing
Tests such as USP ‹905› "Uniformity of Dosage Units" that limit the number of units that can fall into certain ranges and includes a zero-tolerance criterion (as opposed to proposals that uses the location and variability of the results, possibly in the form of tolerance intervals) have been characterized as nonparametric testing, which differ from parametric testing. This classification, although useful, has led to some inaccurate characterizations and comparisons.
For example, it has been said that parametric tests rely on the assumption of normality and that nonparametric tests do not. In truth, normality is not always an essential assumption underlying a parametric test, even though the factors may in some cases be derived from tables of normal tolerance factors. It also has been claimed that so-called nonparametric tests are more robust because they are based on counting and classification. But one must take caution and ask, robust with respect to what? In fact, distributional assumptions are required when evaluating the performance of any and all plans, including nonparametric plans. Furthermore, the performance of all plans varies depending on the type of underlying distribution that is assumed.
Nonetheless, behind all evaluations, one should remember that the operating characteristic of any plan, parametric or nonparametric, is a hypothetical graph of the probability of acceptance under a given set of assumptions. The operating characteristic's features that are used in the design of the plan may or may not be realized in actual practice, depending on the true underlying nature of the property or the attribute in question. This limitation is inherent in all plans, regardless of the sample statistics that form the basis. Thus, the failure of a plan to maintain, for instance, a 5% probability of acceptance for batches having a stated level of nonconforming units is not necessarily more or less probable for a parametric plan than it is for a nonparametric plan.
Again, the operating characteristics of all plans vary depending on the underlying distribution used to evaluate the performance. It is a mistake to choose one test over another solely on the basis of whether the operating characteristics are "better" under any assumption. One should first determine which test provides greater discrimination with regard to the quality characteristic of interest and then evaluate the performance under reasonable distributional assumptions.
Conclusion
Overcoming the myths surrounding a zero-tolerance criterion is difficult. This should not be the case, considering that the technical arguments are quite compelling and conclusive. Notwithstanding, it appears that misunderstandings regarding the role of compendial testing and the role of sampling and acceptance for batch release and stability testing may continue to persist for a while. The ability of final product sampling and testing to detect nonconforming product continues to be oversold. The naïve belief that protection is afforded by zero-accept or zero tolerance continues to be widely held and promulgated.
We believe, however, that the myths can be dispelled and that rational scientific thinking can prevail. At the same time, we realize that unless the psychological baggage is honestly reviewed, assessed, and neutralized, it will be difficult to get to the logical reasons for eliminating zero tolerance as a criterion in batch testing for release and stability. We recommend that everyone involved in the current task of devising and promoting a workable procedure for evaluating the delivered-dose uniformity of orally inhaled and nasal products be prepared to explain both in psychological and scientific terms why zero tolerance should be eliminated.
John R. Murphy, PhD, is an independent consultant. Kristi L. Griffiths, PhD ,* is a research advisor at Eli Lilly and Company, Global Statistical Sciences, KY730, Lilly Corporate Center, Indianapolis, IN 46285, klgriff@lilly.com
*To whom all correspondence should be addressed.
Submitted: June 1, 2005. Accepted: June 22, 2005.
References
1. E.L. Grant and R.S. Leavenworth, "Some Fundamental Concepts in Acceptance Sampling," in Statistical Quality Control (McGraw-Hill, New York, NY, 6th ed., 1988), pp. 393–425.
2. US Food and Drug Administration, Center for Drug Evaluation and Research, Guidance for Industry: Nasal Spray and Inhalation Solution, Suspension, and Spray Drug Products—Chemistry, Manufacturing, and Controls Documentation (docket 99D-1454) (FDA, Rockville, MD, 2002), http://www.fda.gov/cder/guidance/4234fnl.pdf.
3. FDA, Draft Guidance: Metered Dose Inhaler and Dry Powder Inhaler Drug Products—Chemistry, Manufacturing, and Controls Documentation (docket 98D-0997) (FDA, Rockville, MD, 1998), http://www.fda.gov/cder/guidance/2180dft.pdf
4. International Pharmaceutical Aerosol Consortium on Regulation and Science, A Parametric Tolerance Interval Test for Improved Control of Delivered Dose Uniformity of Orally Inhaled and Nasal Drug Products (IPAC-RS, 2001), http://ipacrs.com/PDFs/IPAC-RS_ DDU_Proposal.pdf.
5. USP 27–NF 22 (United States Pharmacopeial Convention, Rockville, MD, 2004), pp. 6–8 and 2396–2397.
6. L.D. Torbeck, "In Defense of USP Singlet Testing," Pharm. Technol. 29 (2), 105–106 (2005).
7. R.L. Williams et al., "Content Uniformity and Dose Uniformity: Current Approaches, Statistical Analyses, and Presentation of an Alternative Approach, with Special Reference to Oral Inhalation and Nasal Drug Products," Pharm. Res. 19, 359–366 (2002).
8. B. Flann, "Comparison of Criteria for Content Uniformity," J. Pharm. Sci. 63, 183–199 (1974).