I've just read Bernouill's Fallacy, by Aubrey Clayton. I couldn't put it down. I feel I need to discuss it here but it's very technical and not easy to summarize for people who haven't studied probability theory and statistics. I will at least try to get the implications across.
The fallacy in question is based in Bernouill's seminal thinking about sampling theory, but it is most easily explained in terms of 20th Century statistical innovations. Put succinctly, the p value is the probability of the observation given the hypothesis. It is not the probability of the hypothesis given the observation.
Since the early 20th Century, the principle method of scientific inference has been computing a p value for what is called the null hypothesis, i.e. that there is not really any difference between two (or more) groups. These could be, for example, the intervention and control arms of an experiment, or men and women who respond to a survey. The p value is ostensibly the probability that any observed difference is due solely to chance in sampling process, and the convention is that if it is less than .05 (or 5%, same thing) you "reject the null hypothesis" and publish your finding.
There are many problems with this conventional procedure. Clayton wrote a whole book about them so I obviously can't do the subject justice in a blog post. Actually I have discussed a few of them recently. The procedure is only valid in its own terms if the experiment or survey was done in an exemplary manner, which is unfortunately often not the case. But even granted impeccable methodology, the inference rule is valid only in exceptional circumstances. Actually it's never entirely valid. What I should say is that it provides useful evidence about whatever your (non-null) hypothesis may be under exceptional circumstances.
The first problem is that there may be alternative hypotheses that explain the observation. The p value gives you no information about which is correct. Another problem is that there are an infinite number of possible hypotheses. If you go about testing them at random, at least 5% of them will give you p <.05, and possibly get published, but the vast majority of these findings will be false. Problem 3 is that if you have a large enough sample, you will just about always get p<.05, even for differences which are trivial and of no practical importance. Problem 4 is that p>.05 doesn't mean the null hypothesis is true.
A common mistake -- and I can't believe this gets published, but it does -- is to conclude that because for group A (say people under 60) p<.05, but for group B (people over 60) p=.08, let's say, the phenomenon affects young people but doesn't affect older people. This is totally nonsensical.
Clayton attributes what is called the "replicability crisis" largely to this flawed form of reasoning. It turns out that several systematic efforts to reproduce published results in various scientific fields have found that on repeating the original procedure, most of the findings don't hold up. Even when they do, the effect sizes are usually smaller. Unfortunately, journal editors are usually not very interested in replication studies, and tenure and promotion depend on getting innovative findings.
That's not to say you shouldn't believe most scientific conclusions. The important ones do get retested and confirmed or overturned through various processes. But a lot of time and money are wasted going down false pathways, and some damage may be done to patients or the economy or other important interests until false findings are corrected. Clayton's solution is to stop doing the conventional forms of significance testing and to use what are called Bayesian methods in all circumstances. So I'll discuss Bayes in the next post, unless something really important comes up first.
No comments:
Post a Comment