P.Mean: So you're thinking about a t-test (created 2016-06-18). News: I'm blogging now. Go to http://blog.pmean.com.

You've got your data and you've heard that you need to analyze it using a t-test. Congratulations on getting this far. I'm going to give some general guidance about how to approach a data analysis when you think you need a t-test. A t-test is pretty easy as far as statistical tests go, but it never hurts to talk to a statistician about this if you can.

First things first. Are you sure you want a t-test?

Do you really want a t-test or do you want something else? Most people consider using a t-test when they have a continuous outcome measure and they want compare the average level of this continuous outcome measure between a treatment or exposure group and a control group. Let me be honest with you right now. I almost never use a formal t-test. Instead, I use something that is mathematically equivalent, a linear regression model with an indicator variable. I'll talk at length about this in a separate web page, but briefly, I like to use linear regression because it allows for easy extensions to the t-test, such as for risk adjusted comparisons.

So is your outcome variable continuous? I define a continuous variable as one that can take on a large number of possible values. This is in contrast to a categorical variable, which can take on only a small number of possible values. So what's large enough to be considered a large number of possible values? There's no consensus in the research community about this. I often use the example of the Apgar score. The Apgar score is a number assigned to a newborn baby, typically one or five minutes after birth that indicates how well the baby is doing. The Apgar score can be as low as 0 (very bad) or as high as 10 (very good) but it doesn't have any fractional values. So that represents 11 possible values. Is that large enough to call the Apgar score a continuous outcome variable?

Well, it depends who you talk to. If you talk to me, I'll tell you that an Apgar score doesn't meet the strict definition of a continuous variable, but it is close enough to allow you to use a t-test. Others might not agree. So if you have anyone who is supervising your research, make sure that they are comfortable with treating your outcome variable as if it is continuous. Their opinion is more important than my opinion.

If your outcome variable is categorical, you will probably end up using some type of logistic regression, and I'll talk about logistic regression in a separate web page.

The other type of outcome variable that is going to raise concerns is an ordinal outcome variable. The classic example of this is the use of one or more Likert scale items on a survey. Take a look, for example, at the SF-36 survey which measures general quality of life. The SF-36 has several subscales, such as physical functioning, which is the average ten items where you are asked to evaluate whether your health limits certain physical activities like "climbing one flight of stairs" and "bathing or dressing yourself". You get to choose between three options: 1=Yes, limited a lot; 2=Yes, limited a little; or 3=No, not limited at all. Your physical functioning subscale is an average of these ten questions, rescaled so the lowest possible value is 0 and the highest possible value is 100. The health change subscale is a single question, Compared to one year ago, how would you rate your health in general now, with five possible responses: 1=much better now than one year ago; 2=somewhat better now than one year ago; 3=about the same, 4=somewhat worse now than one year ago; and 5=much worse now than one year ago.

There's some concern, perhaps, that there aren't enough values to make the scale continuous. The physical functioning subscale has 21 possible values which is probably okay, but the health change subscale has only 5 possible values, which is definitely a concern. There is, however, an even greater concern. Someone assigned consecutive integers to the successive values, which seems like the simplest thing to do, but there really is no empirical justification for the use of consecutive integers. You might consider that large limitations are of greater concern. So someone who is limited a little (2) on both walking a flight of stairs and bathing or dressing yourself is not as badly off as someone who is not limited (3) in walking a flight of stairs but is limited a lot in bathing or dressing yourself (1). A simple average would imply that a 1 combined with a 3 is equivalent to two 2's, and you know in your heart of hearts that this is not the case. Maybe you would be better off coding Yes, limited a lot as 0 instead of 1 to emphasize the greater defiicit associated with "Yes, limited a lot."

Just because you can assign a number to a particular ordinal category level does not mean that this number is an accurate reflection of what that level really means to someone. So, the argument goes, ordinal data should never be averaged.

Now I, myself, would say "Hogwash!" We average ordinal data all the time and it seems to perform reasonably well. One example of this is your grade point average. Many schools assign the number 0 to a grade of F, 1 to a D, 2 to a C, 3 to a B, and 4 to an A. And then we average those numbers without a thought. Your assuming here that two B's are equivalent to an A and a C. I know that assumption is unsupportable. But I still find the grade point average to be a useful statistical summary.

Just as an aside, one of my colleagues said that they would NEVER hire someone who had a single failing grade on their transcript. What they are doing, in effect is changing the 0 for an F all the way down to a negative infinity for an F. Now, I am not this harsh, but I wanted to pass the observation along.

For the most part, I think it is fine to apply a t-test to ordinal data. But my opinion does not really matter. What matters in the opinion of whoever is supervising your work. So ask your boss, do you think I am okay in averaging an ordinal scale item like this? If your boss doesn't know or doesn't care, then my opinion may be important to you. But if your boss has a strong opinion, pro or con, don't worry what I think. There are two good alternatives if your boss is fussy about ordinal data: you can use a nonparametric test, or you can use ordinal logistic regression. I'm not a big fan of nonparametric tests, but I'll save my reservations about them for another day.

Well, that was a lot of thinking before we did any calculations. So what's the verdict? Are you still on board with running a t-test? Okay, but we're not done thinking about the problem.

Next: think about the assumptions of the t-test.

The t-test has three assumptions, independence, normality, and equality of variances. How important these assumptions are is another area of contention in the research community. Let's tackle these assumptions one at a time.

Independence.

The independence assumption means that how large or small the outcome variable is on one patient does not influence how large or small the outcome variable is on another patient. If your patients are randomliy selected and they don't get together and collude somehow, then you should be okay here. One big exception is when you have a center effect. Suppose you are taking measurements at multiple health care sites. Usually, there will be a center effect, meaning that two patients from the same health care site will be more likely to have similar outcomes than when the two patients come from different health care sites. There are several possible explanations for this. Two patients from the same health care site are going to be treated by the same set of doctors using the same equipment and that could produce at least a slight tendency for the results to be related. There might also be some geographic factors at work as well.

If some of your patients come from the same family their results are going to be similar becuase family members share many of the same genes and live in a common environment. I had an extreme example of this in a study of breast feeding. The researchers recruited some twins into the study. I'm sure they were thinking "Oh boy, we get two pieces of data and only have one consent form to sign." The problem, of course, is that the day that one infant stops breast feeding is likely to be the same day that the other infant stops breast feeding, especially if it is because mom has to go back to work. Now this doesn't always happen, but the tendency is strong enough to violate the assumption of indpendence.

When you are not comfortable with the assumption of independence, you should consider some type of hierarchical or multilevel model.

Pairing or matching in a research study also violates the assumption of independence, but there is an easy fix for this, which you'll see below.

Normality.

The normality assumption means that the scatter of data values around the respective means in the two groups falls out like a bell shaped curve. People are all over the map on this assumption. Some people worry a lot about the normality assumption, some say that the normality assumption is is only important if your sample size is small, and some say that the normality assumption is not that big a deal, even with small sample sizes.

Many researchers will cite the Central Limit Theorem here. The Central Limit Theorem says that even if the distribution of individual observations are not normally distributed, the distribution of the average will converge to a normal distribution as the sample size approaches infinity. Okay, okay, I know that you didn't collect an infinite sample size. How large does the sample size need to be in order for you to be "close enough" to a normal distribution? The widely quoted number here is 30. If your sample size is 30 or larger for each group, the assumption of normality is not a concern, thanks to the Central Limit Theorem.

The problem is that the number 30 is certainly incorrect. With some types of distributions, a sample size of 10 per group is more than enough. With other types of distributions, a sample size of 100 or even 1,000 might not be enough.

If you do decide to put your stock in with those who worry about the normality assumption, you still need to assess the type of deviation from normality. Asymettry is important. A positively skewed or skewed right distribution--a distribution that is bunched up on the left and spread more widely on the right is going to lead to problems, as is a negative skewed or skewed left distribution where the data bunches up on the right and spreads more widely on the left. Even with a large sample size, the t-test is inefficient with a skewed distribution. With a small sample size, a skewed distribution might produce a t-test that doesn't properly control the Type I error rate.

Just a quick review: the Type I error rate is the probability that you reject the null hypothesis, when the null hypothesis is true. Think of it as a false positive rate. Researchers will take great care to insure that the Type I error rate is small, typically 5%. If you think your Type I error rate is 5% but it is actually 2% instead or 20%, that's not a good thing.

Another serious issue is a distribution which is symmetric, but which produces a lot of outliers on the high end and on the low end. Again, the t-test will be inefficient in this setting and might have problems with proper control of the Type I error rate.

Other types of deviations from normality are usually less serious. You might, for example, see a bimodal distribution. The bell shaped curve has a single mode, so this is clearly a deviation from normality. But you'll find that if there is nothing else unusual other than the bimodality, that you can easily rely on the Central Limit Theorem. Of course, bimodality is an interesting property and you should comment on it and investigate what might be causing the bimodality. I'd certainly hunt for any factors in your data set that might produce this bimodality. Maybe the values for females tend to be much larger than the values for males in both your treatment group and your control group. Run a simple search of the Internet, and you'll find lots of images that show the Central Limit Theorem in action.

With one exception (discussed below), I generally do not worry too much about normality, even with small sample sizes. But you'll find that there is no consensus in the research community on whether it is okay to use a t-test when your data is not normally distributed.

Equality of variances.

The third assumption is equality of variances. If your treatment group is a lot more variable than your control group (or vice versa), then you have some potential problems. The t-test requires a standard error, and the simple formula for the standard error doesn't work well if one group is a lot more variable than the other. Actually, it works pretty well if the two groups have the same sample size. If one group has a much larger sample size AND it has much larger or much smaller variation, then you have problems. In particular, you can no longer guarantee the Type I error rate.

How much of a discrepancy in variation do you need to see before you start worrying about this? I typically look for one standard deviation is about three times larger or smaller than the other. There's nothing magical about the number 3, of course, and often the pattern is more important than the ratio. If the group with the larger mean also has a larger standard deviation, I tend to be more likely to conclude that the variances are not equal. This pattern occurs in practice far more often than the reverse.

If you are worried about the equality of variances assumption, there is an alternate form of the t-test, called the Satterthwaite approximation, that works well. Only approximately, of course, but in practice it does very well. If you're curious, the Satterthwaite approximation uses a slightly different formula for the standard error and requires a tricky adjustment to the degrees of freedom.

There is no consensus in the research communtiy about how much you need to worry about the equality of variances assumption. Some researchers will tell you that you should always use the Satterthwaite approximation. It is valid whether the assumptions of equality variances is met or not. Some researchers will tell you that you should run a formal test of equality of variances. One of these, Levene's test, is quite popular. If Levene's test rejects the hypothesis of equality of variances, then use the Satterthwaite approximation. Otherwise, use the traditional form of the t-test. Still other researchers will tell you that you should use the Satterthwaite approximation when the sample sizes are unequal and use the traditional form of the t-test if the sample sizes are equal.

My recommendation is to use the traditional form of the t-test, unless you have a strong expectation based on information that you know prior to data collection, that the equality of variances assumption is not met. I do not like the idea of running of statistical tests in succession, with a branch to one test or another being dependent on the result of a previous test. I much prefer the idea of a single test that you select prior to data collection. The problem with the test then branch approaches is that they look innocuous, but they can often play havoc with your Type I error rate.

Boxplots.

You're probably worried about all these assumptions at this point, but one plot can address two of the three assumptions. If you're not already familiar with the boxplot, you should learn it now, because it provides a wonderful visual check of the normality and equality of variances assumption.

A box plot displays a five important summary statistics. The box of a box plot extends from the 25th percentile of your data to the 75th percentile. Split the box at the median. The draw thin lines from the top of the box up to the maximum value and from the bottom of the box to the minimum value. The five summary statistics (minimum, 25th percentile, 50th percentile, 75th percentile, and maximum) split the data into four regions, each containing 25% of your data.

There are some slight variaions on the boxplots. Most boxplots will plot individual data points that are considered by the boxplot to be too extreme. Most will also display a plus sign or other symbol at the mean (which, of course, is not always equal to the median). Some will put notches in the box to indicate some type of confidence limits. These midifcations are okay.

But beware! Some programs will corrupt the boxplots by having the wihiskers extend only to the 10th and 90th percentiles. Or they will represent the box itself as some type of confidence interval. Stay away! It's not that their ideas are bad. It's just that if everyone uses a different definition for what the size of the box or the length of the whiskers means, you will end up sowing confusion and discord in the research community.

If you have a reasonably traditional boxplot, look first at the medians. Does one group'ss median appear to be smaller than the other group's 25th percentile? Larger than the other group's 75th percentile. I like to informally call this a 25% discrepancy. There's a point in your data where about half of one group's data vales are smaller, but only 25% or maybe more than 75% of the other group's data is smaller. It's not that should ignore discrepancies smaller than this. It's more that you should never ignore a discrepancy of 25%.

In certain extreme cases, the entire box of one group is higher/lower than the entire box of the other group. This is a 50% discrepancy, because there's a data value where only 25% of one group's data is smaller, but 75% of the other group is smaller. This is a Godzilla sized difference and is almost always worth a lot of attention.

There are a series of boxplots in an in vitro study of neonatal stem cells that illustrates this nicely. These boxplots are taken out of order (and perhaps out of context), but you are welcome to view the original paper. The first boxplot shows a 50% discrepancy. The second shows a 25% discrepancy. The third shows a very minor discrepancy.

After comparing the medians, compare the size of the two boxes. Since the boxes stretch from the 25th to the 75th percentile, they represent the range of the middle 50% of the data. There's an official term for this, the interquartile range. If one box is a lot larger or smaller than the other, you have evidence of a substantial difference in variation between the two groups. Don't worry unless the size differential is pretty extreme, say a two or three fold ratio. If you see more than a three fold ratio, you may want to rethink whether the equality of variance assumption is okay.

Here's an example of a boxplot from a paper on malaria endemicity showing a greater than two fold difference in the size of the boxes.

Most box plots have a rule for identifying extreme values or outliers. For most data sets, the whiskers are going to a bit longer than the boxes themselves. But whiskers that are disproportionately large indicate outliers in your data set. When the whiskers end up being more than 1.5 times longer than the box, most software will diagnose mild outliers and highlight individual data points. This limit, 1.5 box-wdiths away from the top or from the bottom of the box is called the inner fence. Anything more than three box widths away from the top or bottom of the boxplot is considered a serious outlier and these limits are called the outer fence. A few outliers are not a concern, especially for large data sets. In fact, the inner fence produces, in my opinion, a few too may false alarms about outliers. But if you have a lot of outliers and/or the outliers look to be extreme, you may want to rethink whether the normality assumption is okay.

Finally, look at the size of the bottom half of the box compared to the top half and look at the length of the bottom whisker to the length of the top whisker. If the bottom half of the box is squished tight and the top half is stretched out, and if the bottom whisker is puny, but the top whisker is long, then you have evidence of a data set that is skewed to the right. The reverse pattern would indicate a data set that is skewed to the left. Very few boxplots are perfectly symmetic, so small discrepancies are fine. It's only when the botttom of the box is one third or less of the size of the top of the box, or a similar ratio for the whixker sizes that would give you concern about the normality assumption. Of course when the bottom pieces of the box plot are a lot larger than the top pieces, you have concerns about a left skewed distribution.

Here's an example of some boxplots illustrating a skwed positive distribution.

and here's an example from the same paper illustrating a symmetric distribution.

Transformations.

You do not have to take the data as it is given to you and analyze it directly. You can and should think about transforming your outcome variable prior to analysis. There are a wide range of transformations that you can choose from such as the square root transformation and the inverse transformation. You can even select from a family of transformations: the Box-Cox transformations. But the only transformation that I use regularly is the log transformation. The log transformation has several special features that make it worth considering for the t-test and even for more complicated tests.

The logarithm is undefined for zero and for negative values, so you cna only consider this transformation if all of your data values are above zero. Some researchers will suggest that you can add a small constant to your data if you have some zero values, but I do not recommend this. The constant you add is quite arbitrary and your results could vary quite a bit depending on the small constant that you choose to add.

Lots of outcome variables have only positive values. Sometimes the physical measurement process prevents zero and negative values from occuring. Sometimes zero and negative values are incompatible with life. Sometimes it just happens that way.

For data values larger than 1, the log function squeezes those values together. The amount of squeezing increases the larger the data values.

For data values less than 1, the log function stretches those values apart, and the smaller the data values the more the stretching.

This relative squeezing and stretching has three potential benefits. First, it tends to reduce the impact of outliers that appear on the high end of the data. Since the largest values get squeezed the most, they get pulled back closer to the rest of your data. Second, if your data is skewed positive (packed closely on the low end and spread more widely on the high end), then the log transformation which streatches on the low end and squeezes on the high end, will often transform to a nice symmetric bell shaped curve. Finally, if one group in your data is much more variable than the other, a log can sometimes squeeze the first group and stretch the second group enough to equalize the variation. This only works if the group with the larger standard deviation is also the group with the larger mean, but in my experience this pattern occurs far more often than the reverse.

There's one more interesting propertly of the log function. if you remember your high school math, the log of a product is equal to the sum of the logs.

log(ab) = log(a) + log(b)

This means that the log transformation will take multiplicative relationships and convert them to additive relationships. It doesn't make much of a difference in a t-test, but sometimes this allows you to more easily fit and interpret complex statistical models.

The paired t-test.

Sometimes the data in your two groups is paired or matched.

The independent samples t-test.

If your data is not matched, then you should use an independent samples t-ttest.