Table Of ContentsNext Page

Statistical evaluation of analytical methods

T. C. Nelsen

USDA-ARS, 1815 No. University St., Peoria, Illinois 61604, USA


A chemical analytical procedure can be a series of ordered steps carried out in a properly equipped laboratory in order to estimate the concentration of a specific analyte present within a given material in an accurate and precise manner. Analytical procedures must be reliable and accurate for scientific, trade and quality control purposes. Scientific experiments can be evaluated and compared internationally only if measurements are standardized. Commodities are valued and traded on the levels of their constituents (i.e., protein, starch, oil) and can be accepted or rejected on levels of contaminants. Within a company, quality control cannot be achieved if different production facilities don’t agree on measurements.


This presentation discussed the use of interlaboratory studies for the evaluation of analytical methods. The analytical methods discussed are sets of standardized procedures used to estimate the concentration of a specific analyte in a given material with a minimum of uncertainty. We assume that the laboratories included in the study are competent and fully capable. There is a true value (actual concentration) we are trying to estimate. The difference between our measurement result and the true value is the uncertainty in the measurement. This uncertainty is called the error in the method and can be divided into random error and systematic error. Random error is both above and below the true value and is usually considered to be the “noise” in the method. We often have difficulty in finding the causes of random error. Systematic error is a consistent bias in the system and can be eliminated or controlled if the cause can be found. We assume that the method developer has checked the method for selectivity, specificity, and linearity.

The precision of a method is a measure of the extent to which individual tests of the same concentration in the same material agree. Repeatability, r, is the internal precision of a method. Two single results obtained within a laboratory under repeatability conditions (same technician with the same instruments in the same laboratory at the same time) should not differ by more than r. Reproducibility, R, is the external precision of a method. Two single results obtained by two different laboratories under reproducibility conditions (different technicians with different instruments in different laboratories at different times) should not differ by more than R.

Accuracy is the extent to which the test results differ consistently from the true value. Analysts can be misguided by consistent results if those results are consistently wrong.

A set of guidelines for standard symbols, terminology and procedures were established at IUPAC Workshops on Harmonization of Collaborative Analytical Studies (IUPAC, 1987) and (AOAC, 1995). These Harmonization Guidelines establish the design of Collaborative Studies to adequately estimate Repeatability and Reproducibility. The Material (sometimes called the Matrix) is the medium which contains the analyte. A minimum of five materials are required. The materials should represent the materials where the method is to be used. A method can for quite specific like AOAC Official Method 994.08 “Alflatoxins in Corn, Almonds, Brazil Nuts, Peanuts, and Pistachio Nuts” or more general like AOAC Official Method 991.42 “Insoluble Dietary Fiber in Foods and Food Products”. Method 991.42 was tested in 22 different materials and the performance results are listed for each (AOAC, 2000). Materials must be homogeneous and stable. Non-homogeneity can cause outliers and will increase the variance estimates.

The Collaborative Study requires 9 laboratories for statistical validation of the performance parameters. I suggest that you start with 12. Leave room for error, non-participation or unforeseen difficulties. The labs should be representative of the labs where you expect your method will be used. I also suggest that you first run an unofficial “mini-collab” with 2 or 3 friendly labs to see if the method has any readily identifiable and fixable problems that had not occurred in your own lab during the development of the method and to ensure that the samples are homogeneous and stable. The data from this min-collab can often be included in the analysis of the whole collaborative study.

Decide on the concentrations where the method will be used. Prepare samples of analyte at levels to bracket and cover this “area of interest”. If zero is in the area of interest, prepare blanks and consider a series of tests to determine the levels of detection and quantitation (to be discussed later). Blanks used for calibration or for practice runs are not considered as one of the 5 materials. Materials with naturally occurring concentrations are preferred over spiked samples whenever possible.

Code the test samples at random so that they will not be analyzed in a set order. Prepare sets of blind or matched duplicates (a pair is considered one material). Duplicates are to be coded randomly. Duplicates are sufficient to estimate internal variance. Rather than designing a study with 5 triplicates, use those same resources to design the study with 7 or 8 duplicates. A matched pair (Youden Pair) is a pair of test samples that are within 5% of each other. Youden (AOAC, 1975) developed the matched pair concept to eliminate the human element in an analysis where the analyst knows that the samples are a set of blind duplicates. Design a data reporting form that you send along with the samples. You want the data to come back to you in the same format from each lab.

When the data is returned to you for analysis, use Cochran’s Test to check for outliers in individual measurements and Grubbs’ Test to check entire laboratories (Wernimont, 1985). An excessive number of outliers is more than 2/9 of the data. A method which produces outliers at greater than the 2/9 ratio is considered to be unstable. An outlier can be the result of a simple mistake but it can sometimes be the result of an unusual chemical reaction or condition.

I would like to see a ranks test (Wernimont, 1985) used more often. For each sample, list the data from all of the labs and sort from largest to smallest. The largest value is assigned 1, the second largest 2, and so on. Do this for each sample. If some labs consistently have the largest or smallest values, then a bias may exist in the method. A Ranks-Sum Test can be run to see if a significant ranking bias exists. I have seen studies where the materials were not stable and labs that ran the tests earlier reported higher values

The analysis of the data is usually done with Analysis of Variance (ANOVA) procedures (Wernimont, 1985) and Delwiche et al. (2005). The primary calculations are the standard deviations for Repeatability (sr) and Reproducibility (sR). Reproducibility includes both the within and between laboratory variances. The between laboratory variance is not reported but is used to calculate Reproducibility. In accordance with ISO Standard 5725; Repeatability, r, is calculated as r = 2.8 x sr and Reproducibility, R, is calculated as R = 2.8 x sR.

At this point the analyst often asks the statistician if the r and R values are reasonable. The statistician’s role in a collaborative study is to ensure that the performance parameters have been calculated correctly and not to pass judgment on the value of a chemical procedure.

A useful statistic to compare different methods is the Relative Standard Deviation (RSD). The RSD is the standard deviation expressed as a percentage of the mean (average). (RSDr = 100sr / mean and RSDR = 100 sR / mean - this is the same statistic as the Coefficient of Determination or CD).

A group of statisticians led by William Horwitz (Horwitz, 1998) took the results of several thousand method evaluations and found a general relationship between concentration and RSDR of Predicted RSDR = 2C-.1505 ,where C is the estimated mean concentration. They recommend that you calculate the actual RSDR from your data and then calculate a Predicted RSDR based on the mean concentration. Divide the RSDR from your data by the predicted RSDR to get a HORRAT value. The HORRAT value should be between .5 and 2.0. If the value is less than .5, the study results are suspected of being too good to be true. A HORRAT value between 1.5 and 2.0 can be an indication of materials instability among other possible problems. A HORRAT value greater than 2 can cause the method to be judged unreliable and thus unacceptable. The RSDr is usually approximately 2/3 of the RSDR

Note that the Predicted RSDR increases rapidly at the concentration approaches zero. At a concentration of 1% you should expect an RSDR of around 2.7%. At 1 ppm the RSDR should be around 11 and at 1 ppb the RSDR should be around 30.

Measuring near zero has some special considerations for statisticians (Currie, 1968). We want to find the concentration at which you can say with some confidence (usually 95 or 99%) that the analyte is present, i.e., the concentration is not zero but you can’t really estimate the true concentration. This is the limit of detection. The limit of quantitation is the concentration at which you can start to provide an estimate with some confidence. These two limits are calculated using the normally distributed variance at or near zero. The variance around zero is sometimes called instrument variation or more simply, noise in the system. We can estimate this noise in one of two ways; by multiple measures of a true zero or a known concentration near zero, or by regression analysis of a series of analyses near zero.

A common problem we statisticians have when estimating this zero standard deviation is caused by analysts censuring the data. When a procedure produces a calculated estimate below zero, the analyst has a tendency to report this estimate as zero or “below detectable limits”. We know that a negative concentration is an abstract concept but we have to accept that a negative estimate can occur because of the calculations we use for our estimates. To estimate the variance around zero we need all of the data, on both sides of zero. If the negatives are adjusted to zero, the statistical effect is to inflate the estimate of the zero variance and the limits of detection and quantitation will be calculated to be larger than they really are.

Guidelines are not as well established for qualitative methods. A quantitative method produces a simple yes/no, positive/negative or above/below result. There are four possible outcomes to a qualitative test: true positive, false positive, true negative, and false negative. I would like to see an Operating Characteristic (OC) curve used more often. First define the Action Level which is the concentration where a decision is made. I don’t like to see zero used as an action level. Zero tends to be a concept rather than a number and is subject to change as our analytical abilities improve. The Area of Interest is a range of concentrations above and below the Action Level. Run a series of tests at several concentrations through the Area of Interest, usually at least 6 tests at each level. Calculate the percent positive results at each level and plot them on a graph with concentration on the x-axis and % positive on the y-axis. The result is commonly a sigmoid (s-shaped) curved centered on the Action Level. This curve easily shows the probabilities of correct results and of false positives or false negatives.


In conclusion, if you wish to have your method officially accepted, I advise you to get a statistician before you start and have that person review the appropriate literature. Check with Official Methods Committee or equivalent in the organization where you want your method listed. Design the study and run a prelim/pilot study. Finally, as always, look at your data to aid in interpretations.


IUPAC Workshop on Harmonization of Collaborative Analytical Studies, (1987). Geneva, Switzerland, May 4-5, (Pure & Appl. Chem. 60, 855-864, 1988. Revised – Pure & Appl. Chem. (1995) 67 331-343

Guidelines for Collaborative Study Procedure to Validate Characteristics of a Method of Analysis. J. Assoc. Off. Anal. Chem. (Vol. 72, No 4, 1989)

Revised – AOAC Official Methods Program, J. AOAC Int. 78(5), 143A-160A (1995)

Official Methods of Analysis of AOAC International (2000) 17th Ed., AOAC International, Gaithersburg, MD, USA

Statistical Manual of the AOAC. (1975) W. J. Youden & E. H. Steiner

Wernimont, G.T. (1985) Use of Statistics to Develop and Evaluate Analytical Methods (William Spendley ed.).

Delwiche, S.R., Palmquist D. E. and J. M. Lynch. (2005). Collaborative Studies for Cereals Analysis. Cereal Foods World. 50 (1) pp 9-17

Horwitz, W., Britton, P. and Chirtel. S. J. (1998). A Simple Method for Evaluating Data from and Interlaboratory Study. Journal of AOAC International 81, No. 6, pp 1257-1265

Currie, L.A. (1968). Limits for Qualitative Detection and Quantitative Determination: Application to Radiochemistry, Anal. Chem. 40, 586-593

Top Of PageNext Page