P.Mean >> Category >> Multiple comparisons (created 2007-06-13).

The use of multiple statistical tests in a wide range of contexts, raises serious concerns. The proposed solutions to these concerns are very controversial. These pages discuss some of the concerns and the debate over the appropriate remedies. Articles are arranged by date with the most recent entries at the top. You can find outside resources at the bottom of this page.

2011

9. The Monthly Mean: How come you have to wait for replication when secondary outcome is statistically significant? (May/June 2011)

8. The Monthly Mean: Can you believe the results of a subgroup analysis? (January/February 2011)

7. The Monthly Mean: Two alternatives to the Bonferroni adjustment (January/February 2011)

2009

6. P.Mean: Do multiple time points require a Bonferroni adjustment? (created 2009-07-18). I'm a little confused as to when to apply the multiple comparisons correction. If I had a measure which compared blood pressure (say) between two groups after 7, 14 and 21 days post procedure, would I need to adjust for multiple comparisons of the order three?

2008

5. P.Mean: Defending Bonferroni (created 2008-10-18). I had someone argue with some advice that I gave, which is a good thing. I had recommended the use of a Bonferroni comparison, and he argued that Bonferroni should not be used when making "independent" comparisons.

4. P.Mean: Can I please skip the Bonferroni adjustment? (created 2008-08-19). I ran multiple correlation analysis for abundances, richness (species density), and diversity of different growth forms in four different landscapes in Colombian Amazonia. My questions is: Do I have to calculate a Bonferroni test to adjust for each probability?

Outside resources:

David Kent, Peter Rothwell, John Ioannidis, Doug Altman, Rodney Hayward. Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal. Trials. 2010;11(1):85. Abstract: "Mounting evidence suggests that there is frequently considerable variation in the risk of the outcome of interest in clinical trial populations. These differences in risk will often cause clinically important heterogeneity in treatment effects (HTE) across the trial population, such that the balance between treatment risks and benefits may differ substantially between large identifiable patient subgroups; the "average" benefit observed in the summary result may even be non-representative of the treatment effect for a typical patient in the trial. Conventional subgroup analyses, which examine whether specific patient characteristics modify the effects of treatment, are usually unable to detect even large variations in treatment benefit (and harm) across risk groups because they do not account for the fact that patients have multiple characteristics simultaneously that affect the likelihood of treatment benefit. Based upon recent evidence on optimal statistical approaches to assessing HTE, we propose a framework that prioritizes the analysis and reporting of multivariate risk-based HTE and suggests that other subgroup analyses should be explicitly labeled either as primary subgroup analyses (well-motivated by prior evidence and intended to produce clinically actionable results) or secondary (exploratory) subgroup analyses (performed to inform future research). A standardized and transparent approach to HTE assessment and reporting could substantially improve clinical trial utility and interpretability." [Accessed December 4, 2010]. Available at: http://www.trialsjournal.com/content/11/1/85.

D Howel, R Bhopal. Assessing cause and effect from trials: a cautionary note. Control Clin Trials. 1994;15(5):331-334. [Accessed February 23, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/8001354.

Anonymous. Carlo Emilio Bonferroni. Excerpt: "I have been intrigued by Bonferroni for some time. His inequalities are widely known, but his life and works are not. I have recently submitted an article for publication in an attempt to rectify this situation, and have also published a short biography in the recent Encyclopaedia of Biostatistics published by Wiley. Links from this page give some information about him. I have also prepared a bibliography of his works which extends the published ones by including volume and page references for most of the articles. The bibliography is also available in BibTex format, which I think should be self explanatory if you use a different reference manager. " [Accessed February 23, 2011]. Available at: http://www.aghmed.fsnet.co.uk/bonf/bonf.html.

N Stallard, P F Thall, J Whitehead. Decision theoretic designs for phase II clinical trials with multiple outcomes. Biometrics. 1999;55(3):971-977. Abstract: "In many phase II clinical trials, it is essential to assess both efficacy and safety. Although several phase II designs that accommodate multiple outcomes have been proposed recently, none are derived using decision theory. This paper describes a Bayesian decision theoretic strategy for constructing phase II designs based on both efficacy and adverse events. The gain function includes utilities assigned to patient outcomes, a reward for declaring the new treatment promising, and costs associated with the conduct of the phase II trial and future phase III testing. A method for eliciting gain function parameters from medical collaborators and for evaluating the designʼs frequentist operating characteristics is described. The strategy is illustrated by application to a clinical trial of peripheral blood stem cell transplantation for multiple myeloma." [Accessed February 23, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/11315037.

Ronald Feise. Do multiple outcome measures require p-value adjustment?. BMC Medical Research Methodology. 2002;2(1):8. Abstract: "BACKGROUND: Readers may question the interpretation of findings in clinical trials when multiple outcome measures are used without adjustment of the p-value. This question arises because of the increased risk of Type I errors (findings of false "significance") when multiple simultaneous hypotheses are tested at set p-values. The primary aim of this study was to estimate the need to make appropriate p-value adjustments in clinical trials to compensate for a possible increased risk in committing Type I errors when multiple outcome measures are used. DISCUSSION: The classicists believe that the chance of finding at least one test statistically significant due to chance and incorrectly declaring a difference increases as the number of comparisons increases. The rationalists have the following objections to that theory: 1) P-value adjustments are calculated based on how many tests are to be considered, and that number has been defined arbitrarily and variably; 2) P-value adjustments reduce the chance of making type I errors, but they increase the chance of making type II errors or needing to increase the sample size.SUMMARY:Readers should balance a studyʼs statistical significance with the magnitude of effect, the quality of the study and with findings from other studies. Researchers facing multiple outcome measures might want to either select a primary outcome measure or use a global assessment measure, rather than adjusting the p-value. [Accessed February 23, 2011]. Available at: http://www.biomedcentral.com/1471-2288/2/8.

S Greenland, J M Robins. Empirical-Bayes adjustments for multiple comparisons are sometimes useful. Epidemiology. 1991;2(4):244-251. Abstract: "Rothman (Epidemiology 1990;1:43-46) recommends against adjustments for multiple comparisons. Implicit in his recommendation, however, is an assumption that the sole objective of the data analysis is to report and scientifically interpret the data. We concur with his recommendation when this assumption is correct and one is willing to abandon frequentist interpretations of the summary statistics. Nevertheless, there are situations in which an additional or even primary goal of analysis is to reach a set of decisions based on the data. In such situations, Bayes and empirical-Bayes adjustments can provide a better basis for the decisions than conventional procedures." [Accessed February 23, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/1912039.

GG Swaen, O Teggeler, LG van Amelsvoort. False positive outcomes and design characteristics in occupational cancer epidemiology studies. Int J Epidemiol 2001: 30(5); 948-54. [Medline] [Abstract] [Full text] [PDF]. Description: This article reviews a series of false positive conclusions in epidemiologic research. The authors find that failure to develop a specific a priori hypothesis led to a three fold greater risk of producing a false positive conclusion.

Bernadette Dijkman, Bauke Kooistra, Mohit Bhandari. How to work with a subgroup analysis. Can J Surg. 2009;52(6):515-522. Excerpt: "Surgical practice should principally be based on evidence originating from high-quality data such as randomized controlled trials (RCTs). Whereas these studies mostly investigate general and representative patient populations, clinical decisions most often depend on individual patient characteristics. To concede to the need of individually based guidelines, many RCTs report analyses on specific subgroups of patients.1,2 The main aim of a subgroup analysis is to identify either consistency of or large differences in the magnitude of treatment effect among different categories of patients. Determining whether the observed overall treatment effect is different across certain subgroups may justly provide some patients with its benefits and protect others from its harm." [Accessed February 23, 2011]. Available at: http://www.cma.ca/multimedia/staticContent/HTML/N0/l2/cjs/vol-52/issue-6/pdf/pg515.pdf.

J R Thompson. Invited commentary: Re: "Multiple comparisons and related issues in the interpretation of epidemiologic data". Am. J. Epidemiol. 1998;147(9):801-806. [Accessed February 23, 2011]. Available at: http://aje.oxfordjournals.org/content/147/9/801.long.

Journal article: X. Sun, M. Briel, S. D. Walter, G. H. Guyatt. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses BMJ. 2010;340(mar30 3):c117-c117. Excerpt: "Subgroup analyses in randomised controlled trials (RCTs) or in meta-analyses of RCTs examine whether treatment effects vary according to patient group, way of giving an intervention, or approach to measuring an outcome. Subgroup analyses are common and often associated with claims of difference of treatment effects between subgroups—termed 'subgroup effect', 'effect modification', or 'interaction between a subgroup variable and treatment'. A difference in effect between subgroups, if true, is likely to have important implications for clinical practice and policy making. Many subgroup claims are, however, subsequently shown to be false. Thus, investigators, clinicians, and policy makers face the challenge of whether or not to believe apparent differences in effect. " [Accessed on April 4, 2011]. http://www.bmj.com/cgi/doi/10.1136/bmj.c117.

T. C. Chamberlin. The Method of Multiple Working Hypotheses. The Journal of Geology. 1931;39(2):155-165. Excerpt: "This essay on methods of scientific thought appeared in the Journal of Geology, Volume V (1897), pages 837-48 under the heading, "Studies for Students." Since then it has come to be regarded by many as a classic of its kind and its influence has been far-reaching. Requests for copies of it are still frequently received, although the available supply was exhausted many years ago. In response to the continued demand and in the belief that the present generation should be familiar with it, this study for students is reprinted in its original form." [Accessed February 23, 2011]. Available at: http://www.jstor.org/stable/30060433.

B W Brown, K Russell. Methods of correcting for multiple testing: operating characteristics. Stat Med. 1997;16(22):2511-2528. Abstract: "We examine the operating characteristics of 17 methods for correcting p-values for multiple testing on synthetic data with known statistical properties. These methods are derived p-values only and not the raw data. With the test cases, we systematically varied the number of p-values, the proportion of false null hypotheses, the probability that a false null hypothesis would result in a p-value less than 5 per cent and the degree of correlation between p-values. We examined the effect of each of these factors on family-wise and false negative error rates and compared the false negative error rates of methods with an acceptable family-wise error. Only four methods were not bettered in this comparison. Unfortunately, however, a uniformly best method of those examined does not exist. A suggested strategy for examining corrections uses a succession of methods that are increasingly lax in family-wise error. A computer program for these corrections is available." [Accessed February 23, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/9403953.

Carl E Counsell, Mike J Clarke, Jim Slattery, Peter A G Sandercock. The miracle of DICE therapy for acute stroke: fact or fictional product of subgroup analysis?. BMJ. 1994;309(6970):1677 -1681. Abstract: "Objective: To determine whether inappropriate subgroup analysis together with chance could change the conclusion of a systematic review of several randomised trials of an ineffective treatment.Design: 44 randomised controlled trials of DICE therapy for stroke were performed (simulated by rolling different coloured dice; two trials per investigator). Each roll of the dice yielded the outcome (death or survival) for that 'patient.' Publication bias was also simulated. The results were combined in a systematic review.Setting: Edinburgh. Main outcome measure—Mortality.Results: The 'hypothesis generating' trial suggested that DICE therapy provided complete protection against death from acute stroke. However, analysis of all the trials suggested a reduction of only 11% (SD 11) in the odds of death. A predefined subgroup analysis by colour of dice suggested that red dice therapy increased the odds by 9% (22). If the analysis excluded red dice trials and those of poor methodological quality the odds decreased by 22% (13, 2P=0.09). Analysis of 'published' trials showed a decrease of 23% (13, 2P=0.07) while analysis of only those in which the trialist had become familiar with the intervention showed a decrease of 39% (17, 2P=0.02).Conclusion: The early benefits of DICE therapy were not confirmed by subsequent trials. A plausible (but inappropriate) subset analysis of the effects of treatment led to the qualitatively different conclusion that DICE therapy reduced mortality, whereas in truth it was ineffective. Chance influences the outcome of clinical trials and systematic reviews of trials much more than many investigators realise, and its effects may lead to incorrect conclusions about the benefits of treatment." [Accessed February 23, 2011]. Available at: http://www.bmj.com/content/309/6970/1677.abstract.

D A Savitz, A F Olshan. Multiple comparisons and related issues in the interpretation of epidemiologic data. Am. J. Epidemiol. 1995;142(9):904-908. [Accessed February 23, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/7572970.

Jason Hsu. Multiple Comparisons: Theory and Methods. 1st ed. Chapman and Hall/CRC; 1996. Excerpt: "Multiple Comparisons covers all-pairwise comparisons, multiple comparisons with the best, and multiple comparisons with a control. Confidence intervals methods and stepwise methods are described. Abuses and misconceptions are exposed, and the reader is guided to the correct method for each problem. Connections with bioequivalence, drug stability, and toxicity studies are discussed. Applications are illustrated with real data, analyzed by computer packages. Extension to the General Linear Model is provided. "

Martin Bland. Multiple significance tests and the Bonferroni correction. Excerpt: "f we test a null hypothesis which is in fact true, using 0.05 as the critical significance level, we have a probability of 0.95 of coming to a `not significantʼ (i.e. correct) conclusion. If we test two independent true null hypotheses, the probability that neither test will be significant is 0.95 times 0.95 = 0.90 (Section 6.2). If we test twenty such hypotheses the probability that none will be significant is 0.9520 = 0.36. This gives a probability of 1 - 0.36 = 0.64 of getting at least one significant result; we are more likely to get one than not. The expected number of spurious significant results is 20 times 0.05 = 1. Many medical research studies are published with large numbers of significance tests. These are not usually independent, being carried out on the same set of subjects, so the above calculations do not apply exactly. However, it is clear that if we go on testing long enough we will find something which is `significant'. We must beware of attaching too much importance to a lone significant result among a mass of non-significant ones. It may be the one in twenty which we should get by chance alone. " [Accessed February 23, 2011]. Available at: http://www-users.york.ac.uk/~mb55/intro/bonf.htm.

Rodney Hayward, David Kent, Sandeep Vijan, Timothy Hofer. Multivariable risk prediction can greatly enhance the statistical power of clinical trial subgroup analysis. BMC Medical Research Methodology. 2006;6(1):18. Abstract: "BACKGROUND: When subgroup analyses of a positive clinical trial are unrevealing, such findings are commonly used to argue that the treatmentʼs benefits apply to the entire study population; however, such analyses are often limited by poor statistical power. Multivariable risk-stratified analysis has been proposed as an important advance in investigating heterogeneity in treatment benefits, yet no one has conducted a systematic statistical examination of circumstances influencing the relative merits of this approach vs. conventional subgroup analysis. METHODS: Using simulated clinical trials in which the probability of outcomes in individual patients was stochastically determined by the presence of risk factors and the effects of treatment, we examined the relative merits of a conventional vs. a "risk-stratified" subgroup analysis under a variety of circumstances in which there is a small amount of uniformly distributed treatment-related harm. The statistical power to detect treatment-effect heterogeneity was calculated for risk-stratified and conventional subgroup analysis while varying: 1) the number, prevalence and odds ratios of individual risk factors for risk in the absence of treatment, 2) the predictiveness of the multivariable risk model (including the accuracy of its weights), 3) the degree of treatment-related harm, and 5) the average untreated risk of the study population. RESULTS: Conventional subgroup analysis (in which single patient attributes are evaluated "one-at-a-time") had at best moderate statistical power (30% to 45%) to detect variation in a treatmentʼs net relative risk reduction resulting from treatment-related harm, even under optimal circumstances (overall statistical power of the study was good and treatment-effect heterogeneity was evaluated across a major risk factor [OR = 3]). In some instances a multi-variable risk-stratified approach also had low to moderate statistical power (especially when the multivariable risk prediction tool had low discrimination). However, a multivariable risk-stratified approach can have excellent statistical power to detect heterogeneity in net treatment benefit under a wide variety of circumstances, instances under which conventional subgroup analysis has poor statistical power. CONCLUSION: These results suggest that under many likely scenarios, a multivariable risk-stratified approach will have substantially greater statistical power than conventional subgroup analysis for detecting heterogeneity in treatment benefits and safety related to previously unidentified treatment-related harm. Subgroup analyses must always be well-justified and interpreted with care, and conventional subgroup analyses can be useful under some circumstances; however, clinical trial reporting should include a multivariable risk-stratified analysis when an adequate externally-developed risk prediction tool is available." [Accessed February 23, 2011]. Available at: http://www.biomedcentral.com/1471-2288/6/18.

K J Rothman. No adjustments are needed for multiple comparisons. Epidemiology. 1990;1(1):43-46. Abstract: "Adjustments for making multiple comparisons in large bodies of data are recommended to avoid rejecting the null hypothesis too readily. Unfortunately, reducing the type I error for null associations increases the type II error for those associations that are not null. The theoretical basis for advocating a routine adjustment for multiple comparisons is the "universal null hypothesis" that "chance" serves as the first-order explanation for observed phenomena. This hypothesis undermines the basic premises of empirical research, which holds that nature follows regular laws that may be studied through observations. A policy of not making adjustments for multiple comparisons is preferable because it will lead to fewer errors of interpretation when the data under evaluation are not random numbers but actual observations on nature. Furthermore, scientists should not be so reluctant to explore leads that may turn out to be wrong that they penalize themselves by missing possibly important findings." [Accessed February 23, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/2081237.

Karim Hirji, Morten Fagerland. Outcome based subgroup analysis: a neglected concern. Trials. 2009;10(1):33. Abstract: "BACKGROUND: A subgroup of clinical trial subjects identified by baseline characteristics is a proper subgroup while a subgroup determined by post randomization events or measures is an improper subgroup. Both types of subgroups are often analyzed in clinical trial papers. Yet, the extensive scrutiny of subgroup analyses has almost exclusively attended to the former. The analysis of improper subgroups thereby not only flourishes in numerous disguised ways but also does so without a corresponding awareness of its pitfalls. Comparisons of the grade of angina in a heart disease trial, for example, usually include only the survivors. This paper highlights some of the distinct ways in which outcome based subgroup analysis occurs, describes the hazards associated with it, and proposes a simple alternative approach to counter its analytic bias. RESULTS: Data from six published trials show that outcome based subgroup analysis, like proper subgroup analysis, may be performed in a post-hoc fashion, overdone, selectively reported, and over interpreted. Six hypothetical trial scenarios illustrate the forms of hidden bias related to it. That bias can, however, be addressed by assigning clinically appropriate scores to the usually excluded subjects and performing an analysis that includes all the randomized subjects. CONCLUSION: A greater level of awareness about the practice and pitfalls of outcome based subgroup analysis is needed. When required, such an analysis should maintain the integrity of randomization. This issue needs greater practical and methodologic attention than has been accorded to it thus far. [Accessed February 23, 2011]. Available at: http://www.trialsjournal.com/content/10/1/33.

H J Kim, M P Fay, E J Feuer, D N Midthune. Permutation tests for joinpoint regression with applications to cancer rates. Stat Med. 2000;19(3):335-351. Abstract: "The identification of changes in the recent trend is an important issue in the analysis of cancer mortality and incidence data. We apply a joinpoint regression model to describe such continuous changes and use the grid-search method to fit the regression function with unknown joinpoints assuming constant variance and uncorrelated errors. We find the number of significant joinpoints by performing several permutation tests, each of which has a correct significance level asymptotically. Each p-value is found using Monte Carlo methods, and the overall asymptotic significance level is maintained through a Bonferroni correction. These tests are extended to the situation with non-constant variance to handle rates with Poisson variation and possibly autocorrelated errors. The performance of these tests are studied via simulations and the tests are applied to U.S. prostate cancer incidence and mortality rates." [Accessed February 23, 2011]. Available at: http://shrpssb.umdnj.edu/shrpnwk/COURSE/binf7540/shankar/joinpoint.pdf.

P Buettner, C Garbe, I Guggenmoos-Holzmann. Problems in defining cutoff points of continuous prognostic factors: example of tumor thickness in primary cutaneous melanoma. J Clin Epidemiol. 1997;50(11):1201-1210. Abstract: "Continuous prognostic factors are often categorized by defining optimized cutoff points. One component of criticism of this approach is the problem of multiple testing that leads to an overestimation of the true prognostic impact of the variable. The present study focuses on another crucial point by investigating the dependence of optimized cutoff points on the observed distribution of the continuous variable. The continuous variable investigated was the vertical tumor thickness according to Breslow, which is known to be the most important prognostic factor in primary melanoma. Based on the data of 5093 patients, stratified random samples were drawn out of six artificially created distributions of tumor thickness. For each of these samples, Cox models were calculated to explore optimized cutoff points for tumor thickness together with other prognostic variables. The optimized cutoff points for tumour thickness varied considerably with the underlying distribution. Even in samples from the same distribution, the range of cutoff points was amazingly broad and, for some of the distributions, covered the whole region of possible values. The results of the present study demonstrate that optimized cutoff points are extremely data dependent and vary notably even if prerequisites are constant. Therefore, if the classification of a continuous prognostic factor is necessary, it should not be based on the results of one single study, but on consensus discussions including the findings of several investigations." [Accessed February 23, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/9393376.

Kenneth J. Ottenbacher. Quantitative Evaluation of Multiplicity in Epidemiology and Public Health Research. American Journal of Epidemiology. 1998;147(7):615 -619. Abstract: "Epidemiologic and public health researchers frequently include several dependent variables, repeated assessments, or subgroup analyses in their investigations. These factors result in multiple tests of statistical significance and may produce type 1 experimental errors. This study examined the type 1 error rate in a sample of public health and epidemiologic research. A total of 173 articles chosen at random from 1996 issues of the American Journal of Public Health and the American Journal of Epidemiology were examined to determine the incidence of type 1 errors. Three different methods of computing type 1 error rates were used: experiment-wise error rate, error rate per experiment, and percent error rate. The results indicate a type 1 error rate substantially higher than the traditionally assumed level of 5% (p < 0.05). No practical or statistically significant difference was found between type 1 error rates across the two journals. Methods to determine and correct type 1 errors should be reported in epidemiologic and public health research investigations that include multiple statistical tests." [Accessed February 23, 2011]. Available at: http://aje.oxfordjournals.org/content/147/7/615.abstract.

Rodney A. Hayward, David M. Kent, Sandeep Vijan, Timothy P. Hofer. Reporting Clinical Trial Results To Inform Providers, Payers, And Consumers. Health Aff. 2005;24(6):1571-1581. Abstract: "Results of randomized clinical trials are the preferred "evidence" for establishing the benefits and safety of medical treatments. We present evidence suggesting that the conventional approach to reporting clinical trials has fundamental flaws that can result in overlooking identifiable subgroups harmed by a treatment while underestimating benefits to others. A risk-stratified approach can dramatically reduce the chances of such errors. Since professional and economic incentives reward advocating treatments for as broad a patient population as possible, we suggest that payers and regulatory bodies might need to act to motivate prompt, routine adoption of risk-stratified assessments of medical treatments' safety and benefits." [Accessed December 4, 2010]. Available at: http://content.healthaffairs.org/cgi/content/abstract/24/6/1571.

E D Moreira Jr, Z Stein, E Susser. Reporting on methods of subgroup analysis in clinical trials: a survey of four scientific journals. Braz. J. Med. Biol. Res. 2001;34(11):1441-1446. Abstract: "Results of subgroup analysis (SA) reported in randomized clinical trials (RCT) cannot be adequately interpreted without information about the methods used in the study design and the data analysis. Our aim was to show how often inaccurate or incomplete reports occur. First, we selected eight methodological aspects of SA on the basis of their importance to a reader in determining the confidence that should be placed in the authorʼs conclusions regarding such analysis. Then, we reviewed the current practice of reporting these methodological aspects of SA in clinical trials in four leading journals, i.e., the New England Journal of Medicine, the Journal of the American Medical Association, the Lancet, and the American Journal of Public Health. Eight consecutive reports from each journal published after July 1, 1998 were included. Of the 32 trials surveyed, 17 (53%) had at least one SA. Overall, the proportion of RCT reporting a particular methodological aspect ranged from 23 to 94%. Information on whether the SA preceded/followed the analysis was reported in only 7 (41%) of the studies. Of the total possible number of items to be reported, NEJM, JAMA, Lancet and AJPH clearly mentioned 59, 67, 58 and 72%, respectively. We conclude that current reporting of SA in RCT is incomplete and inaccurate. The results of such SA may have harmful effects on treatment recommendations if accepted without judicious scrutiny. We recommend that editors improve the reporting of SA in RCT by giving authors a list of the important items to be reported." [Accessed February 23, 2011]. Available at: http://www.scielo.br/pdf/bjmbr/v34n11/4111.pdf.

Journal article: Ben Goldacre. Researchers don't mean to exaggerate, but lots of things can distort findings The Guardian. 2011. Excerpt: "It's possible people are not bothering to report a negative result alongside positive ones they found." [Accessed on August 15, 2011]. http://www.guardian.co.uk/commentisfree/2011/aug/12/bad-science-exaggerated-study-results.

Rupert G. Jr. Miller. Simultaneous Statistical Inference. 2nd ed. Springer; 1981. Description: Rupert Miller wrote the classic text on multiple comparisons adjustments back in 1981. The book does not include some recent methods, but does provide a good overview. Dr. Miller also outlines several different philosophical perspectives on multiple comparison adjustments. This book is a gentle introduction to a specialized topic.

A J Sankoh, M F Huque, S D Dubey. Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Stat Med. 1997;16(22):2529-2542. Abstract: "Confirmatory clinical trials often classify clinical response variables into primary and secondary endpoints. The presence of two or more primary endpoints in a clinical trial usually means that some adjustments of the observed p-values for multiplicity of tests may be required for the control of the type I error rate. In this paper, we discuss statistical concerns associated with some commonly used multiple endpoint adjustment procedures. We also present limited Monte Carlo simulation results to demonstrate the performance of selected p-value-based methods in protecting the type I error rate." [Accessed February 23, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/9403954.

Rui Wang, Stephen W Lagakos, James H Ware, David J Hunter, Jeffrey M Drazen. Statistics in medicine--reporting of subgroup analyses in clinical trials. N. Engl. J. Med. 2007;357(21):2189-2194. Excerpt: "Medical research relies on clinical trials to assess therapeutic benefits. Because of the effort and cost involved in these studies, investigators frequently use analyses of subgroups of study participants to extract as much information as possible. Such analyses, which assess the heterogeneity of treatment effects in subgroups of patients, may provide useful information for the care of patients and for future research. However, subgroup analyses also introduce analytic challenges and can lead to overstated and misleading results.1-7 This report outlines the challenges associated with conducting and reporting subgroup analyses, and it sets forth guidelines for their use in the Journal. Although this report focuses on the reporting of clinical trials, many of the issues discussed also apply to observational studies." [Accessed February 23, 2011]. Available at: http://www.nejm.org/doi/full/10.1056/NEJMsr077003.

St Brookes, E Whitley, Tj Peters, et al. Subgroup analyses in randomised controlled trials: quantifying the risks of false-positives and false-negatives. Excerpt: "Subgroup analyses are common in randomised controlled trials (RCTs). There are many easily accessible guidelines on the selection and analysis of subgroups but the key messages do not seem to be universally accepted and inappropriate analyses continue to appear in the literature. This has potentially serious implications because erroneous identification of differential subgroup effects may lead to inappropriate provision or withholding of treatment." [Accessed May 19, 2010]. Available at: http://www.hta.ac.uk/execsumm/summ533.shtml.

David I Cook, Val J Gebski, Anthony C Keech. Subgroup Analysis in Clinical Trials. The Medical Journal of Australia. 2004;180(6):289-291. Excerpt: "Clinical trials represent a major investment by investigators, sponsors and participants, and it is reasonable to attempt to gain the maximum information from them. Practitioners and regulatory agencies are keen to know whether there are subgroups of trial participants who are more (or less) likely to be helped (or harmed) by the intervention under investigation, and a recent survey of trials published over 3 months in four leading journals found that 70% included subgroup analyses.1,2 Furthermore, regulatory guidance documents (such as the Committee for Proprietary Medicinal Products September 2002 document Points to consider on multiplicity issues in clinical trials3) strongly encourage appropriate subgroup analyses. The results of subgroup analyses can also drive changes in practice guidelines. For example, the United States National Institutes of Health issued a clinical alert following the unexpected finding in the BARI (Bypass Angioplasty Revascularisation Investigation) trial that mortality after angioplasty in patients with diabetes was nearly double that after bypass-graft surgery (P = 0.003).4 Meaningful information from subgroup analyses within a randomised trial is restricted by multiplicity of testing and low statistical power. There is therefore a tension between our wish to identify heterogeneity in the responses of trial participants to trial interventions and our technical capacity for doing so. Surveys on the adequacy of the reporting of clinical trials consistently find the reporting of subgroup analysis to be characterised by poor practice.2,5-7 Item 18 of the CONSORT checklist (Box 1) deals with the multiplicity issues that arise in subgroup analysis.8" [Accessed February 23, 2011]. Available at: http://www.mja.com.au/public/issues/180_06_150304/coo10086_fm.html.

António Vaz Carneiro. Subgroup analysis in therapeutic trials. Rev Port Cardiol. 2002;21(3):339-346. Abstract: "Therapy in cardiology must be based on solid scientific evidence, obtained in randomized controlled trials (RCTs), since this is the best design that proves causality in medicine. The applicability of clinical trial results to the individual patient depends on a rigorous set of rules that can be summarized in the question "Could my patient have been enrolled in this trial?" If the answer to this question is affirmative, then the possibility of applying the trial results is greater. If it is negative, then the cardiologist should exercise caution in his or her decision. In an RCT--of whatever size--it is almost always possible to identify subgroups of patients that show significant differences in treatment effect: for example, studies have shown that, in patients with non-rheumatic atrial fibrillation, oral anticoagulants should be given to prevent stroke, except in those younger than 65 years with no additional risk factors, for whom aspirin is a better option. Subgroup analysis is important because, when the magnitude of the difference is both real and large, it may influence patient management. This analysis should be done with great care, since it has the potential to lead to major errors in data interpretation, identifying differences in treatment effects that are due to chance alone or, more frequently, have no clinical significance. In this article we present a set of guidelines that enable the cardiologist to assess the credibility of an analysis that shows apparent differences in treatment effects across subgroups." [Accessed February 23, 2011]. Available at: http://www.spc.pt/DL/RPC/artigos/608.pdf.

Xin Sun, Matthias Briel, Jason Busse, et al. Subgroup Analysis of Trials Is Rarely Easy (SATIRE): a study protocol for a systematic review to characterize the analysis, reporting, and claim of subgroup effects in randomized trials. Trials. 2009;10(1):101. Abstract: "BACKGROUND: Subgroup analyses in randomized trials examine whether effects of interventions differ between subgroups of study populations according to characteristics of patients or interventions. However, findings from subgroup analyses may be misleading, potentially resulting in suboptimal clinical and health decision making. Few studies have investigated the reporting and conduct of subgroup analyses and a number of important questions remain unanswered. The objectives of this study are: 1) to describe the reporting of subgroup analyses and claims of subgroup effects in randomized controlled trials, 2) to assess study characteristics associated with reporting of subgroup analyses and with claims of subgroup effects, and 3) to examine the analysis, and interpretation of subgroup effects for each studyʼs primary outcome. METHODS: We will conduct a systematic review of 464 randomized controlled human trials published in 2007 in the 118 Core Clinical Journals defined by the National Library of Medicine. We will randomly select journal articles, stratified in a 1:1 ratio by higher impact versus lower impact journals. According to 2007 ISI total citations, we consider the New England Journal of Medicine, JAMA, Lancet, Annals of Internal Medicine, and BMJ as higher impact journals. Teams of two reviewers will independently screen full texts of reports for eligibility, and abstract data, using standardized, pilot-tested extraction forms. We will conduct univariable and multivariable logistic regression analyses to examine the association of pre-specified study characteristics with reporting of subgroup analyses and with claims of subgroup effects for the primary and any other outcomes. DISCUSSION: A clear understanding of subgroup analyses, as currently conducted and reported in published randomized controlled trials, will reveal both strengths and weaknesses of this practice. Our findings will contribute to a set of recommendations to optimize the conduct and reporting of subgroup analyses, and claim and interpretation of subgroup effects in randomized trials." [Accessed February 23, 2011]. Available at: http://www.trialsjournal.com/content/10/1/101.

R John Simes, Val J Gebski, Anthony C Keech. Subgroup Analysis: Application To Individual Patient Decisions. The Medical Journal of Australia. 2004;180(9):467-469. Excerpt: "Clinical trials provide evidence of effectiveness of treatments as an average for a group of patients, yet, in clinical medicine, we usually wish to apply these results to individuals. Can we simply apply the overall trial result for each patient, or can the result be tailored to individual patients in some way? Consider a hypothetical example: a randomised trial comparing treatments A and B shows that treatment A is more effective than B among men (P < 0.001), but not among women (not signficant). Does this mean men should receive the new treatment, but women should not?" [Accessed February 23, 2011]. Available at: http://www.mja.com.au/public/issues/180_09_030504/sim10218_fm.html.

Jerry Dallal. There must be something buried in here somewhere!. Description: This webpage uses a simulation to illustrate what happens with one hundred simultaneous independent tests of significance. [Accessed February 23, 2011]. Available at: http://www.jerrydallal.com/LHSP/multtest.htm.

Thomas V Perneger. Whatʼs wrong with Bonferroni adjustments. BMJ. 1998;316(7139):1236-1238. [Accessed March 10, 2009]. Available at: http://www.bmj.com/cgi/content/full/316/7139/1236.

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-09-16. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.

2008

3. Stats: When a client asks for a bad analysis (March 24, 2008). I received an email from someone who was being asked to perform a subgroup analysis that is likely to produce confusing and counter-intuitive results. I was asked to help draft some language to convince the client that this was a bad idea.

2004

2. Stats: Subgroup analysis (December 21, 2004). A recently published trial shows a logical approach for establishing the validity of a subgroup comparison.

1999

1. Stats: Bonferroni correction (September 3, 1999). Dear Professor Mean: I keep reading about something called a Bonferroni correction. Somehow this method keeps researchers from going on a fishing expedition. Could you explain what a Bonferroni correction is and why we want to keep scientists from fishing? -- Judicious John

What now?

Browse other categories at this site

Browse through the most recent entries

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-09-16.