Updated: When the F test is significant, but Tukey is not (created 2005-09-09)

Someone asked me how to interpret a one factor analysis of variance where the overall F test was significant, but the Tukey folloup test comparing all four group means was not significant for any pair of means. The short answer is to report that the F test was significant and that Tukey was not. Point out that the Tukey follow-up test is conservative because it attempts to control the overall alpha level. Most people understand what a conservative test is and they will accept that interpretation.

I suspect that one or more of the pairwise comparisons was borderline, so you might talk about that also. Or you could look at the unadjusted comparisons and one of those almost has to be significant. These findings of course need to be interpreted with caution, because they were not statistically significant using a more conservative criterion.

Here's the longer and more technical answer. The F test examines whether all four means are equal and the alternative that is frequently used, that at least one pair of means differs, is not quite accurate. The alternative is really that there is a linear contrast among the four means that is significantly different from zero. A pairwise difference is one example of a linear contrast, but there are other linear contrasts that Tukey does not look at.

For example it might be that the first mean does not differ significantly from the third mean and the second mean does not differ significantly from the fourth mean, but maybe an average of the first and second means differs significantly from an average of the third and fourth means. Or maybe the fourth mean is slightly smaller than the other means, but not enough to be statistically significant for any pair. But when you average the other three means, you get enough precision to get a statistically significant difference from the fourth mean. Perhaps it would be worthwhile to search for an interesting contrast of the means that differs from zero.

Another technical difficulty is that Tukey's test is based on the studentized range which behaves slightly differently than the F test. There are situations (not too many) where the studentized range statistic is significant, but the F test is not. And there are situations (again not too many) where the F test is significant and the studentized range statistic is not. You just have the bad luck of encountering one of those rare situations. Now I don't think it helps to make such technical arguments in a paper, so it is just simpler to admit that there is a discrepancy and blame the conservative nature of the Tukey followup test.