Death to the Mann-Whitney U and its Allies!

WARNING: Statistics content ahead. Not my usual blog-fare.

In the early 20th century, a guy named Fisher figured out that the if you wanted to know if two (or n) samples came from populations with equal means, you could solve this problem by looking at all the different possible assignments of data to the groups and seeing how likely it is that the difference in means you observed could have happened by chance. (This is called a randomization test or a permutation test.)

The fundamental problem with this was that if you have even a relatively modest number of observations in even two groups, the number of possible assignments of observations to conditions grows very quickly as a function of the sample size. So, for instance, if you have 10 data points and 2 groups, you have to evaluate 252 possibilities (this was before computers, so this would have been tedious but do-able), but if you have 20 data points and two groups, you have 184,756 possibilities; not tractable with pencil-and-paper.

So, realizing this was not going to be tractable, Fisher invented the ANOVA (which pretty much puts him in the Smart Guy Hall of Fame), which is an ingenious bit of mathematics which makes use of the idea of sampling distributions. The ANOVA (and here I refer to only the between-subjects ANOVA; things are more complicated in the within-subjects case) and methods like it (for 2 groups the t-test is equivalent) do require some ancillary assumptions, namely:

[1] Equal variance in each group. This is because the within-groups error term is only a valid estimate of population variance when there is only one population variance to estimate.

[2] Normal sampling distributions for the parameters being estimated, namely the mean and the standard deviation. (Note: Or variance if you don’t like standard deviations. The variance will actually be a chi-square, not a normal, but the basic idea is the same.)

[3] Independence of errors. Generally speaking this is not an issue in between-subjects designs as one subject’s score doesn’t affect the other, but there may be some issues with condition effects. For now, I’m going to ignore this one.

Despite the fact that corrections exist, people violate assumption 1 to varying degrees more or less routinely. This is probably mostly OK since, as long as the sample sizes in the groups are roughly equal, all this usually costs you is power. (Everyone appears to be taught the adjustment in the t-test for violation of this assumption but for some odd reason this is uncommon for the ANOVA. SPSS has the same idiosyncrasy; it prints out the adjustment for an independent-samples t but not for a between-subjects ANOVA. This makes no sense; apparently JMP does this better—good for JMP!)

Anyway, let’s look at assumption 2. I have noticed that there appears to be something of a misconception about this assumption. You will, in fact, see this assumption stated incorrectly (even in statistics textbooks) that the distribution of the population data must be normal. While in some sense this is technically true, that’s really not the best way to think about it. The assumptions in the ANOVA (and related tests like the t-test) are about the sampling distributions of the relevant statistics, not the population distributions. So, why is this important?

This is important because it means that the ANOVA is generally (though of course not always) a lot more robust than many people seem to believe. Your sample data need not be perfectly normal in order to use an ANOVA on them. Why not? Because what needs to be normal are the sampling distributions of the mean and the variance. And, in fact, the sampling distribution for the mean will indeed be normal, or at least very close to normal (assuming reasonable sample size), always. How do we know this? By the Central Limit Theorem. (If you don’t know what that is, then go back and re-read your intro statistics textbook; turns out that was one of the really important bits.)

OK, so how about the standard deviation? It’s true, there is no Central Limit Theorem for the standard deviation. We do know that if the population distribution is normal, the sampling distribution for the standard deviation will be normal. This is, I believe, the source of the misconceptions. Many people thus believe the population data must be normal or it really messes up the sampling distribution of the standard deviation. Technically speaking, it is correct that if the population distribution is non-normal, the sampling distribution of the standard deviation is not guaranteed to be normal.

So, therefore, if your sample data are non-normal, you cannot use the ANOVA, right? This belief certainly permeates at least psychology and HCI, because I keep getting papers to review where people behave this way.

First, there is a problem of inference from sample data. Just because your sample data are non-normal does not guarantee that the population from which you are sampling is non-normal. You have access to a sample, not a population, and the ANOVA assumption refers to the population. So you cannot automatically assume the ANOVA is invalid based purely on the shape of your sample distribution. Unfortunately, there is no way to know for sure if your skewed sample means a skewed population.

However, even if you could know, the situation is just not that clear-cut. It turns out that the sampling distribution of the standard deviation tends, in general, to be very normal-like, even for pretty funky population distributions, as long as the sample size is reasonable. I would encourage you to go to David Lane’s Hyperstat site and play with the Sampling Distribution demo therein. Try really wacky distributions, and take samples of 20,000 or more. Notice how normal the sampling distribution for the standard deviation is, even with an N as low as 16? This is, of course, an eyeball test and those often miss the finer points, and sometimes those finer points matter. Yes, it is possible to generate sampling distributions of the standard deviation which are sufficiently non-normal enough to mess up the ANOVA. But it’s not easy. And, as it turns out, mostly what gets screwed up when you violate this assumption is the power of the test, not the Type I error rate. (Which we all know must be preserved at all costs, even though people routinely perform multiple tests without adjustment. Sorry, that’s a topic for another day.)

So, what does this mean? Can you always use an unadjusted ANOVA? No, of course not. If your data are highly non-normal (badly kurtotic or skewed), have unequal variances, and different sample sizes, then you have a problem. But if all you have is, say, skewed data, your problem might not be that bad. To see this illustrated, check out the Robustness demo from the Hyperstat site: take two “severely” skewed populations with equal standard deviation and you’ll notice that Type I error rate is nearly always preserved. (As Box observed in the 1950s; see Box, G. E. P. (1953). Non-normality and tests on variances. Biometrika, 40, 318–335.)

However, no matter how bad your data are, the solution is not the Mann-Whitney U test or similar procedures. At least not anymore.

Why not? Essentially, what these tests do is throw away data. More precisely, what they do is transform your data into ranks (throwing away the interval information), then perform a randomization test on the ranks, because there is a closed-form (i.e., analytic) solution to the randomization test when using the number sequence 1..n. It’s a crappy, low-power alternative, (when the ANOVA assumptions are met, anyway) but it indeed does not make the kind of distributional assumptions made by the ANOVA. But am I saying it should not be used even when the ANOVA assumptions are violated? Yes, I am.

While the Mann-Whitney U and its cousins were certainly reasonable things to do in 1956 when Siegal wrote his still-popular book on non-parameteric statistics, I cannot imagine what the justification is for doing this kind of test in 2008. If you’re worried enough about distributional assumptions that you refuse to do the standard parametric test (which is fine but not necessary in many cases), then do the bloody randomization test on your raw data! Or do some boostrap-based variant. We have computers now—fast ones—that can resample your data 20,000 times in the blink of an eye (in 2008, even a cheap laptop can do this). These resampling techniques also do not require any distributional assumptions about your data, but they are clearly more powerful than old-school non-parametric methods like the Mann-Whitney U. If Fisher had access to a fast computer in 1915 he probably never would have bothered with the ANOVA in the first place. We now have fast computers, so why on earth should we bother with watered-down alternatives which are simply cheapened versions of the randomization test that Fisher himself recognized as appropriate?

So, if I’m reviewing your manuscript and I see a Mann-Whitney U or anything of its ilk (Kruskal-Wallace, Friedman, etc.), I’m rejecting it (well, OK, maybe just giving it revise-and-resubmit if everything else looks good)—unless, of course, what you actually have is rank data. Since people rarely have rank data, you then have alternatives: either realize the the ANOVA is at least somewhat robust to assumption violations and go ahead and use it, or do a randomization or other resampling test. (I’m sure Rand Wilcox would argue for some other form of robust test; one of those would be fine as well.) The U and those like it are a wishy-washy bucket of “neither” and, given modern computing power, should be purged from statistics courses and textbooks. And you old-school statistics instructors out there: stop teaching these procedures as appropriate! You can teach them for historical context so your students know what they are when they see them, but stop telling students to actually perform these tests! They had their day in the sun, but that day has been over for a good twenty years now. Teach them permutation or bootstrap tests instead.

(Thanks to David Lane for helpful comments and suggestions on an earlier draft of this.)

June 4th, 2008

Went to an Apple store yesterday and got my first chance to really play with an iPhone. Obviously not a perfect device, but one heck of a lot better than any other phone I’ve ever seen. The screen is terrific and seems very scratch-resistant (given how none of the ones out for public handling had any evidence of scratches, even after I saw someone try to operate one with their keys). Battery life is supposed to be great, there really is a dock connector, etc. etc.

Why the title of this post? On the way back out to my car, I called T-Mobile to ask when my contract is up; it runs through June 3rd, and it’s $200 per line to break the contract. Since this would involve my wife’s line as well, that’s $400. With the iPhone itself being pretty expensive, we’d be talking a grand to switch. It’s cool, but it’s not that cool. And maybe the second generation iPhone will be out by next June, or at least there will be a drop in price. (3G would be great, Apple, really.)

The really funny part is that the T-Mobile customer service guy asked me why I was asking about my contract date, and I told him I has been looking at the iPhone. He was ready with all the “usual suspects” of what was wrong with it. I guess that’s to be expected. No, the really sad part is what he tried to sell me, the T-Mobile “Wing.” This has to be some kind of joke. Windows Media Player? Synchronizes with Outlook? Ugh. If that’s the competition, I’m even more sure that I’m waiting for an iPhone. Yikes.

Minority Report User Interface

I have a little exercise I give in class to anyone who suggests the Minority Report UI is a good idea:

• Stand up.
• Extend your arms in front of you.
• Hold them there for as long as you typically spend with keyboard and mouse, occasionally moving them around a little and with a lot of wrist twisting. For me, that’s often two four- to six-hour stretches in a day.

Now choose:
[1] I would like to replace my current desktop UI with this, or
[2] I would like to schedule an appointment with someone who does rotator cuff surgery.

People are just not designed to stand there with arms fully or mostly extended for hours at a time.

The fact that people find this so appealing is really a symptom of a larger issue: the fundamental thing many people don’t understand is that “looks cool” does not translate into “actually a good idea for a UI.”

Remote codes

I recently acquired an Oppo DV-981HD DVD player, and for quite a while I’ve had a DirecTV HR20 satellite receiver. I also have a One For All universal remote control (an 8910). I really had to scour the Web to find codes, but I kind of didn’t need to: contacting One For All to get upgrades to the codes worked great. (Incidentally, the code for the Oppo is 1115, and the code for the HR20 is 1377.)

What is particularly cool is that One For All no longer does the upgrade over the phone. Instead, they email you with the address of a Web page, which has on it an embedded .WAV file, which you then play through your computer to upgrade the remote. Very slick.