WARNING: Statistics content ahead. Not my usual blog-fare.

In the early 20th century, a guy named Fisher figured out that the if you wanted to know if two (or *n*) samples came from populations with equal means, you could solve this problem by looking at all the different possible assignments of data to the groups and seeing how likely it is that the difference in means you observed could have happened by chance. (This is called a randomization test or a permutation test.)

The fundamental problem with this was that if you have even a relatively modest number of observations in even two groups, the number of possible assignments of observations to conditions grows very quickly as a function of the sample size. So, for instance, if you have 10 data points and 2 groups, you have to evaluate 252 possibilities (this was before computers, so this would have been tedious but do-able), but if you have 20 data points and two groups, you have 184,756 possibilities; not tractable with pencil-and-paper.

So, realizing this was not going to be tractable, Fisher invented the ANOVA (which pretty much puts him in the Smart Guy Hall of Fame), which is an ingenious bit of mathematics which makes use of the idea of sampling distributions. The ANOVA (and here I refer to only the between-subjects ANOVA; things are more complicated in the within-subjects case) and methods like it (for 2 groups the t-test is equivalent) do require some ancillary assumptions, namely:

[1] Equal variance in each group. This is because the within-groups error term is only a valid estimate of population variance when there is only one population variance to estimate.

[2] Normal sampling distributions for the parameters being estimated, namely the mean and the standard deviation. (Note: Or variance if you don’t like standard deviations. The variance will actually be a chi-square, not a normal, but the basic idea is the same.)

[3] Independence of errors. Generally speaking this is not an issue in between-subjects designs as one subject’s score doesn’t affect the other, but there may be some issues with condition effects. For now, I’m going to ignore this one.

Despite the fact that corrections exist, people violate assumption 1 to varying degrees more or less routinely. This is probably mostly OK since, as long as the sample sizes in the groups are roughly equal, all this *usually* costs you is power. (Everyone appears to be taught the adjustment in the *t*-test for violation of this assumption but for some odd reason this is uncommon for the ANOVA. SPSS has the same idiosyncrasy; it prints out the adjustment for an independent-samples *t* but not for a between-subjects ANOVA. This makes no sense; apparently JMP does this better—good for JMP!)

Anyway, let’s look at assumption 2. I have noticed that there appears to be something of a misconception about this assumption. You will, in fact, see this assumption stated incorrectly (even in statistics textbooks) that the distribution of the population data must be normal. While in some sense this is technically true, that’s really not the best way to think about it. The assumptions in the ANOVA (and related tests like the t-test) are about the *sampling* distributions of the relevant statistics, not the population distributions. So, why is this important?

This is important because it means that the ANOVA is generally (though of course not always) a lot more robust than many people seem to believe. Your sample data *need not be perfectly normal* in order to use an ANOVA on them. Why not? Because what needs to be normal are the sampling distributions of the mean and the variance. And, in fact, the sampling distribution for the mean will indeed be normal, or at least very close to normal (assuming reasonable sample size), **always**. How do we know this? By the Central Limit Theorem. (If you don’t know what that is, then go back and re-read your intro statistics textbook; turns out that was one of the really important bits.)

OK, so how about the standard deviation? It’s true, there is no Central Limit Theorem for the standard deviation. We do know that if the population distribution is normal, the sampling distribution for the standard deviation will be normal. This is, I believe, the source of the misconceptions. Many people thus believe the population data must be normal or it really messes up the sampling distribution of the standard deviation. Technically speaking, it is correct that if the population distribution is non-normal, the sampling distribution of the standard deviation is not guaranteed to be normal.

So, therefore, if your sample data are non-normal, you cannot use the ANOVA, right? This belief certainly permeates at least psychology and HCI, because I keep getting papers to review where people behave this way.

First, there is a problem of inference from sample data. Just because your sample data are non-normal does **not** guarantee that the population from which you are sampling is non-normal. You have access to a sample, not a population, and the ANOVA assumption refers to the population. So you cannot automatically assume the ANOVA is invalid based purely on the shape of your sample distribution. Unfortunately, there is no way to know for sure if your skewed sample means a skewed population.

However, even if you could know, the situation is just not that clear-cut. It turns out that the sampling distribution of the standard deviation tends, in general, to be very normal-like, even for pretty funky population distributions, as long as the sample size is reasonable. I would encourage you to go to David Lane’s Hyperstat site and play with the Sampling Distribution demo therein. Try really wacky distributions, and take samples of 20,000 or more. Notice how normal the sampling distribution for the standard deviation is, even with an N as low as 16? This is, of course, an eyeball test and those often miss the finer points, and sometimes those finer points matter. Yes, it is possible to generate sampling distributions of the standard deviation which are sufficiently non-normal enough to mess up the ANOVA. But it’s not easy. And, as it turns out, mostly what gets screwed up when you violate this assumption is the power of the test, not the Type I error rate. (Which we all know must be preserved at all costs, even though people routinely perform multiple tests without adjustment. Sorry, that’s a topic for another day.)

So, what does this mean? Can you always use an unadjusted ANOVA? No, of course not. If your data are highly non-normal (badly kurtotic or skewed), have unequal variances, and different sample sizes, then you have a problem. But if all you have is, say, skewed data, your problem might not be that bad. To see this illustrated, check out the Robustness demo from the Hyperstat site: take two “severely” skewed populations with equal standard deviation and you’ll notice that Type I error rate is nearly always preserved. (As Box observed in the 1950s; see Box, G. E. P. (1953). Non-normality and tests on variances. *Biometrika, 40*, 318–335.)

However, no matter how bad your data are, the solution is **not** the Mann-Whitney U test or similar procedures. At least not anymore.

Why not? Essentially, what these tests do is throw away data. More precisely, what they do is transform your data into ranks (throwing away the interval information), then perform a randomization test on the ranks, because there is a closed-form (i.e., analytic) solution to the randomization test when using the number sequence 1..*n*. It’s a crappy, low-power alternative, (when the ANOVA assumptions are met, anyway) but it indeed does not make the kind of distributional assumptions made by the ANOVA. But am I saying it should not be used even when the ANOVA assumptions are violated? Yes, I am.

While the Mann-Whitney U and its cousins were certainly reasonable things to do in 1956 when Siegal wrote his still-popular book on non-parameteric statistics, I cannot imagine what the justification is for doing this kind of test in 2008. If you’re worried enough about distributional assumptions that you refuse to do the standard parametric test (which is fine but not necessary in many cases), then **do the bloody randomization test** on your raw data! Or do some boostrap-based variant. We have computers now—fast ones—that can resample your data 20,000 times in the blink of an eye (in 2008, even a cheap laptop can do this). These resampling techniques also do not require any distributional assumptions about your data, but they are clearly more powerful than old-school non-parametric methods like the Mann-Whitney U. If Fisher had access to a fast computer in 1915 he probably never would have bothered with the ANOVA in the first place. We now have fast computers, so why on earth should we bother with watered-down alternatives which are simply cheapened versions of the randomization test that Fisher himself recognized as appropriate?

So, if I’m reviewing your manuscript and I see a Mann-Whitney U or anything of its ilk (Kruskal-Wallace, Friedman, etc.), I’m rejecting it (well, OK, maybe just giving it revise-and-resubmit if everything else looks good)—unless, of course, what you actually have is rank data. Since people rarely have rank data, you then have alternatives: either realize the the ANOVA is at least somewhat robust to assumption violations and go ahead and use it, or do a randomization or other resampling test. (I’m sure Rand Wilcox would argue for some other form of robust test; one of those would be fine as well.) The U and those like it are a wishy-washy bucket of “neither” and, given modern computing power, should be purged from statistics courses and textbooks. And you old-school statistics instructors out there: stop teaching these procedures as appropriate! You can teach them for historical context so your students know what they are when they see them, but stop telling students to actually perform these tests! They had their day in the sun, but that day has been over for a good twenty years now. Teach them permutation or bootstrap tests instead.

(Thanks to David Lane for helpful comments and suggestions on an earlier draft of this.)