Updated: Documenting negative results in a research paper (October 11, 2001)

Documenting negative results in a research paper (created 2001-10-11, revised 2011-04-26).

This page is moving to a new website.

Dear Professor Mean, I have just finished a well-designed research study and my results are negative. I'm worried about publication bias; most journals will only accept papers that show positive results. How do I document the negative findings in a research paper in a way that will convince a journal to accept my paper? -- Apprehensive Arturo

Dear Apprehensive,

Don't worry about publication bias. While it is true that a study with positive results is more likely to get published, it's only a tendency. Besides we really don't know if publication bias is caused by referees and journal editors. Some people suspect that the researchers themselves hold back on publishing negative results: a type of self-censorship.

Also please keep in mind that terms like "negative results" are simplistic, subjective, and ambiguous. There is good evidence, for example, that two people reading the same paper can often come up with different opinions about whether that study is positive or negative.

Still your question is a good one. How do you document that the results from your negative study have credibility? It's a question that should be asked from the other side, as well. If you read a negative study, what should you look for to decide whether the study has credibility?

Short answer

If you have a well-designed negative study and you want to get it published, be sure to stress the aspects of your study that are well designed. Most of these aspects are the same whether the study is positive or negative. For a negative study, though, you should emphasize two things:

confidence intervals for all of your key outcome measures,
justification of your sample size, ideally done prior to data collection.

More details

If you show that your sample size was adequate, either by showing that your study had adequate power, or by showing that your confidence intervals have good precision, then your negative findings will have a lot of credibility.

Power/sample size calculation

You say your study was well-designed. Good! That means that you have a power or sample size calculation. Your power or sample size calculation is best done a priori (prior to the collection of data). If you only calculate power post hoc (after the data are collected), make sure that the effect size used in that calculation is based on what is considered a clinically relevant difference, and is not based on the difference that was observed in your study.

Post hoc power calculations that use the differences observed in the study are useless, because they tell you nothing more than what your p-value already told you. If you have a large p-value, then the post hoc power at the observed difference is always very low. If you have a small p-value, then it is always very high.

When papers specify power or sample size calculations, their results are sometimes ambiguous. A 1998 paper on the use of various doses of an anti-emetic drug notes that

"Five hundred twenty patients were needed to detect a significant difference with 90% probability"

but they don't define what a significant difference is. When you write your paper, be sure to specify the following:

the outcome variable you are basing the power calculation on,
how much of a change you consider clinically relevant,
any estimated values (standard deviations, baseline rates) that you used in the power calculation,
where you got these estimates, and
the power or the probability that your research design would detect that size difference.

It's also nice to provide a reference for the formula and/or the software you used for your power calculation.

Confidence intervals

A second way to enhance the credibility of negative findings is to present your results using confidence intervals. The width of a confidence interval provides especially valuable information for a negative study. If the interval is so narrow that it excludes any clinically relevant difference, then you have demonstrated a clear lack of effect.

If instead the confidence interval is wide enough to drive a truck through, then you have a lot of uncertainty and ambiguity. The wide confidence intervals show that maybe the negative findings could be real or maybe they could caused by an inadequate sample size. This is a very unhappy situation, because it means that we will never know for sure why the study was negative.

Since your study was well-designed, all of your confidence intervals will be narrow enough to make definitive statements about the effect or lack of effect of your treatment. Well maybe not all of your confidence intervals, but those for your primary outcome measures will be narrow.

Example

A paper in the New England Journal of Medicine describes

"a double-blind placebo controlled trial of multiple-allergen immunotherapy in 121 allergic children with moderate-to-severe perennial (year-round) asthma."

The authors conclude that

"immunotherapy with injections of allergens for over two years was of no discernible benefit in allergic children with perennial asthma who were receiving appropriate medical treatment."

Let's examine how they justify this negative finding.

Power calculation

In the methods section, the authors state that

"on the basis of response rates from the most comparable previous study, we estimated a priori that a sample of 60 subjects per group would be required for an alpha level of 0.05 and a beta level of 0.8."

I suspect that the authors meant to say either that the power was 0.8 or that the beta level was 0.2. It is clear from previous context, that the authors are referring to the primary outcome variable, a 10 point medication score. They did not, however, define how much of a change in the medication score they considered relevant.

On the positive side, the authors did conduct this power calculation a priori (prior to data collection).

Confidence intervals

The other thing that the authors did which was very helpful was to present confidence intervals for all of their outcome measures. Since they made measurements at baseline and at the conclusion of therapy, the relevant confidence intervals involve change scores (differences between conclusion and baseline).

The immunotherapy group showed a decline of 1.4 units in the medication score, and a decline of 1.2 units in the placebo group. The immunotherapy group showed a 0.2 unit larger decline than the placebo group. This is a bit confusing, because we are looking at a change in the change score. What this means is that although the immunotherapy group showed a decline it was only slightly better than what a placebo would provide..

The 95% confidence interval for the change in change scores is -0.48 to 0.92. This tells us that even after accounting for sampling error, the largest improvement due to immunotherapy relative to placebo is still less than one unit.

I'm not an expert on asthma, but this seems like a clinically insignificant change. Generally, it is hard to get excited by changes less than one unit on these 5 or 10 point scales. In surveys about people's attitudes, for example, a one unit change usually implies a subtle change in adjectives (slightly disagree versus moderately disagree).

The medication score is a bit trickier to interpret, however. The authors describe "a score of 0 indicated no medication, 2, two to four doses of albuterol; 6 inhaled beclomethasone or alternate-day methylprednisone; and 10, a high dose of methylprednisone (>1 mg per kilogram per day)." So it appears that a one unit decrease at best implies either a lowering in dosage or moving to a less serious type of medication.

You can see another indication that this is a small change, by looking at the relative size of the change. The baseline medication scores were around 5 for both groups, so the 0.2 unit change tells us that the immunotherapy group got 4% closer (.2/5) to the goal of no medication than the placebo group did. Even a full unit change seems unimpressive since it represents only getting 20% closer.

Finally, it is useful, sometimes to compute the effect size, the magnitude of the change divided by the standard deviation of the change. Both the groups had a standard deviation of around 2 units, so a 0.2 unit change represents 0.1 standard deviations.

Jacob Cohen defines a small effect size as 0.2 standard deviations, and this is a difference so small that it would be imperceptible. For example, the difference in heights between 14 and 15 year old girls is roughly 0.2 standard deviations. So clearly, the difference that we did see, is very small.

A full unit change in the medication score represents a 0.5 standard deviation difference, which Jacob Cohen describes as a medium effect size. So we see that this sample is large enough that we can clearly rule out the possibility of a medium effect size. Furthermore, the effect size that we did see is smaller than small.

You can apply this type of logic to any of the secondary endpoints as well. The researchers looked at how often the asthma patients needed to make medical contacts. For example, while both the placebo and the control group had declines in the number of telephone calls in the past 60 days, the decline was larger in the immunotherapy group (0.22 versus 0.05, or a difference of 0.17). The 95% confidence interval is -0.49 to 0.16, showing that at best, immunotherapy led to an average of half a phone call less than placebo. Even the busiest physician would not notice an extra phone call every 120 days. The other measures of medical contacts (office visits, emergency room visits, and hospitalization) also had confidence intervals that are so narrow as to exclude any clinically relevant change.

NNT and NNH calculations

Another secondary endpoint is the proportion of patients showing partial remission (a medication score of 2 or less) and complete remission (a medication score of 0). Although these authors do not present an NNT (Number Needed to Treat) calculation, we can do so ourselves in order to get a feel for whether the differences seen between the immunotherapy and placebo group are clinically relevant.

The authors report that 28% of the patients in the immunotherapy group showed partial remission compared to 21% of the placebo group. This is an absolute change of 7%. This change did not achieve statistical significance. We should still ask ourselves the question, even if this difference did achieve significance, is it enough of a difference to be clinically relevant.

You can compute the NNT by inverting the absolute change. In this case, NNT=14 (=1/.07). This tells you that you have to treat 14 patients with immunotherapy in order to see one additional partial remission on average, compared to placebo. This number is large, telling us that we have to treat a lot of patients to see a small number of successes.

The picture for complete remission is even more pessimistic. The rates of complete remission are 7.5% and 5.8% respectively, showing that the immunotherapy led to an absolute decrease of 1.3%. This difference also fails to achieve statistical significance. The number needed to treat is 77 (=1/.013). You would have to treat 77 patients on average to see one additional complete remission in medication usage.

It's interesting to compare this to the side effects seen with the immunotherapy. Systemic reactions occurred in 34% of the immunotherapy patients and only 7% of the placebo patients. This is a 27% difference, which did achieve statistical significance. Inverting this difference provides us with an estimate of NNH (Number Needed to Harm). Think of the number needed to harm as a measure of how often you will see additional side effects. For this endpoint, NNH=3.7 which tells you that you will see one additional systemic reaction on average for every four patients that we treat with immunotherapy instead of placebo.

The ratios between NNT and NNH calculations are sometimes instructive as well. You will see 3.8 (=14/3.7) systemic reactions for every partial remission and 21 (=77/3.7) systemic reactions for every complete remission.

As someone without much medical expertise, I hesitate to try to judge the tradeoffs between improvements in partial and complete remission compared to additional risk for systemic reactions. A trained physician, however, can make useful judgements when the data are presented this way.

I should also caution that these NNT and NNH calculations should ideally be accompanied by confidence limits. The calculation of confidence limits, however, is a lot more complex than the calculation of the NNT and NNH values themselves.

Summary

Apprehensive Arturo has just finished a research study that is negative and worries that he won't be able to publish his results. Professor Mean assures him that he can publish his paper as long as he documents that the research was good. This documentation should include a power calculation conducted prior to data collection and/or confidence intervals to summarize the magnitude of his effects.

Further reading

The Lang and Secic book give some good general advice about how to write up research results.

How to Report Statistics in Medicine.
Lang TA, Secic M.
Phildelphia PA: American College of Physicians (1997).
ISBN: 0-943126-44-4.