Peter Gotzsche in the BMJ confirms something that I've always sorta kinda noticed, but every time I just got annoyed and figured it was an anomalous screwup, without putting together that it was actually fairly typical.

A while back I bored everybody with a discussion of the concept of statistical significance. Check it out if you are really interested in a statistical primer. It's a term everybody has heard a million times but most people, I would venture to say, don't know what it means. In a pistachio shell (and this statement is imprecise but it's all that will fit) it is the probability that an observed difference between two groups taken as samples from a larger population is due to chance alone, instead of some real difference between the groups. It's called the P value. P=.05 means there is a 5% probability the difference just comes from randomly picking people in group A who happen to be different from people in group B, as opposed to representing an effect of the variable which distinguishes the two groups, and that you are trying to study.

For some reason, the hive mind decided at some point before I was born to call any P value less than or equal to .05 "statistically significant," and any value greater than that "not statistically significant." In other words, if a study shows there is a 94% chance that talking on a cellphone causes brain cancer, the author is supposed to conclude that talking on a cellphone does not cause brain cancer. (I kid you not.) On the other hand, if it's a large study, a result that is statistically significant could nevertheless represent a very tiny difference, one that nobody would really care about, even though it's "significant."

There are more problems. If you make several comparisons, chances are that one of them will appear to be statistically significant (i.e., P < .05) just by chance. As Gotzsche points out, if you make 200 comparisons in which there is, in fact, no real association, you will get at least one "significant" P value 99.996% of the time, which is awfully close to always.

Well, it turns out, according Gotzche's review of published studies, that in abstracts of clinical trials, a significant P value for relative risk was the first result reported 70% of the time; and 84% of the time in both cohort and case-control studies. (In a clinical trial, you assign people to at least two different groups, hopefully randomly, one of which gets a treatment and the other doesn't; in a cohort study, you follow groups of people who happen to be different in some way over time; and in a case-control study, you match people who have some condition with people who don't, and look for differences in their pasts that might account for it. Again, imprecisely and in a pistachio shell.) Even higher percentages of abstracts gave at least one significant result (i.e., counting not just the first result reported).

Also, there were lots of studies that had P values at or just under .05, but hardly any that had values just above .05.

There are a whole lot of problems with this. First of all, an ethical requirement for conducting clinical trials is called "equipoise" -- you have to really not know whether condition A is better than condition B. Therefore, significant results should not be found in such a high proportion of trials, there should be lots of "negative" findings as well. The high percentage of significant results is even more surprising given the small size of most trials -- the chances of finding a significant result go up with sample size.

Second, most of the studies included multiple comparisons, but no attempt was made to correct for this. If you fish around in data, comparing various sub-groups (just the women, just the people under 40, just the Latinos, etc., and mixing and matching) you will almost always be able to find some statistically significant results -- but again, the P value in such cases is essentially meaningless because it doesn't reflect the true probability of a real difference.

Third, in observational (cohort and case control) studies it is common to construct multivariate models. If you fool around with the "control" variables, you can pretty much always get some sort of a significant result to come out. This is not exactly the same idea as the second point, but it's analogous.

Fourth, since in observational studies the groups are not randomly selected, they are bound to be different in some ways other than the variable you are looking at, and if you have a large enough sample size, you will just about always have a significant result. (People who use cell phones a lot are bound to be different from people who don't in other ways as well -- so how do you know it's really the cell phones that are causing cancer after all?)

Fifth, it's pretty weird that you have a lot of P values just under .05, and hardly any just above. The proportion ought to be about equal. Gotzsche doesn't quite connect the dots, but the explanation for this is probably some combination of publication bias ("negative" findings are less likely to get published); cherry picking of results by authors, guided by the arbitrary .05 standard; and fudging or torturing of data to get the P under .05.

Sixth, when Gotzsche did his own calculations, he found that a lot of the published results were just plain wrong.

He concludes: "Significant results in abstracts are common but should generally be disbelieved."

What he doesn't quite get around to pointing out is that in most published biomedical research, the point is to evaluate the effectiveness of some drug or treatment. We should conclude that the literature consistently, systemically, relentlessly, overstates the effectiveness of treatments.

Yup.

## Wednesday, August 09, 2006

### Lies, damn lies . . .

Subscribe to:
Post Comments (Atom)

## No comments:

Post a Comment