Map of life expectancy at birth from Global Education Project.

Friday, July 23, 2021

On the bias of science: more on methods

As I have said, there are many different methods used by scientists. What I will focus on today is so-called Gaussian statistics and a number called the p-value. This has been discussed enough in scholarly literature of late that many non-scientists probably have a general idea that there is controversy associated with it and think they at least understand the gist of it. 


Just in case you don't really understand it, let me try to explain it as simply as I can. If you flip a coin multiple times, the most likely result will be an equal number of heads and tails. The second most likely will be one more heads or one more tails, and so on, with the least likely being all heads or all tails. If you make a bar graph it will look something like this:

Binomial distribution - Wikipedia

You'll notice they've superimposed a smooth curve, which is called the normal distribution, and that's what you get with an infinite number of coin flips. (The "tails" on each end actually extend asymptotically but you can't show that in a finite graph.) It turns out that a lot of phenomena in the real world that aren't binary, but continuous, such as people's height, look something like this. But that's not so important. What's really important is that regardless of whether some quantity is normally distributed, if you take a whole lot of random samples the means of those samples will be normally distributed. One of the properties of the normal curve is called the standard deviation; this is a measure of how spread out the values are. About 2/3 of the values will be within 1 SD and 95% within 2 SD of the mean.


So let's say you want to compare samples A and B and and figure out if they are different from each other on some property C. You calculate the mean difference between A and B on the quantity C and you can also estimate the SD of that mean from the SD of A and B. (It's easy to look up the formulas for all this if you like but it isn't necessary.) That means that you can estimate the probability of seeing whatever mean difference there is between the samples if, in the total population, there isn't really any difference between A and B, in other words you just happened to see a difference because of random chance. That's called the p-value.


It could be anything, from 0 to 1. So how do we decide if there really is a difference? Somewhere in the past (I could look up the name but who cares), somebody arbitrarily decided that it has to be .05 or less in order to draw a conclusion. This is completely arbitrary. There is no perceptible basis for it whatever. However, the whole world went along and we live with the consequences.


One consequence is that if the p-value is .06 or more you're supposed to conclude that there is no difference between A and B. This is obviously not a valid conclusion but that's what you're required to write in your paper. However, since negative findings are less likely to be published, investigators really want to get to that magical <.05. There are various ways to do this. For example, you can decide that you shouldn't count "outliers," extreme values, because they must be erroneous or there must be something weird about them. Or you can look after the fact for sub-groups within A and B for which the p-value is <.05 and draw conclusions about them. 


I have actually seen published papers (and I have reviewed papers in which I had to point out that this is fallacious) that conclude that because the p-value is .06 for one subgroup, and .04 for another, the difference exists for one but not the other. This is a profound and obvious error of inference, but it makes its way into the peer reviewed literature. In other words, a whole lot of people with Ph.D.s and academic appointments don't understand this supposedly elementary concept.

 

Another problem is that before we get to calculating the p-value, there are various reasons why you are likely to erroneously find what you are looking for. For example, somebody has to measure the quantity of interest, and if they know or an guess which group the subject is in they will have an unconscious (hopefully) tendency to bias the measurement. You can adjust the eligibility requirements for your study to include people who are more likely to show a difference. You can stop collecting data once you see what you want to see. And there's more.


For this reason, some people advocate that a conclusion should only be drawn if the p-value is considerably less than .05, such as .01. Other people want to do away with it entirely, and only report the confidence interval. What is the 95% (2SD) interval for the observed difference? While we argue about theses matters, however, progress toward understanding the world is muddled and slowed down. It doesn't mean we aren't getting anywhere, but it means it's harder than it ought to be.


More to come.

 

 

No comments: