Here's a new commentary on the concept of statistical significance which I like because it's written in plain language and its accessible to people who haven't taken statistics courses. I have discussed this before but maybe not lately. It's bothered me ever since I studied environmental toxicology back in the 1980s.
In a pistachio shell, the basic idea of a P value is that you are looking at a difference between two groups of people or other entities, which differ in some other way. For example, one group has been exposed to some environmental agent and the other has not, and you want to know if they differ in their risk for some outcome such as cancer. (That's a very common example.) You will likely see that the incidence of cancer is different between the two groups -- it would be unlikely for it to be identical just because of random chance. But is it different enough to conclude that the exposure really makes a difference? The P value is supposed to represent the probability of getting the observed difference if there was no real difference between the groups.
Completely arbitrarily, at some point in the dark backward and abysm of time, it became conventional to conclude that the difference is real if P <.05, in other words that the probability of the observed difference assuming no real difference is less than 5%. If P > .05, you're supposed to say that the difference is "not statistically significant." However, generally speaking investigators conclude that no real difference exists.
This is wrong on both side of the .05 design, although lately most of the concern about it has been that many (indeed most) statistically significant conclusions are wrong. But as the linked essay argues, it works both ways. One common error of inference I have encountered as a reviewer occurs when investigators look at subgroups. For example, a finding is statistically significant in women but not in men, because P=.03 in women and P=.07 in men. The investigators then conclude that the agent is toxic for women but not for men. This is completely wrong. You have to test the interaction, in other words determine whether the groups are different from each other. But even then you'll get a P value.
"Significant" P values often lead to wrong conclusions for many reasons. A very common one is that the investigators made many comparisons, and one or two of them fell out with P < .05. But the real probability if you made, say, ten observations of getting one "significant" one is 50%. You haven't shown anything at all. On the other hand the conclusion might be true. If you're trying to decide whether, say, some exposure is toxic you don't want to conclude that it isn't just because your P value is .06.
So we need to change the culture of science around this issue. We know that science has managed to keep progressing in spite of its many cultural flaws, but we need to do better. This is one area where there is a good, robust, self examination going on.
Subscribe to:
Post Comments (Atom)
1 comment:
For your consideration:
http://deathtobullshit.com/
Post a Comment