Map of life expectancy at birth from Global Education Project.

## Tuesday, April 27, 2010

### Isn't a "central limit" a contradiction in terms?

Maybe so but the Central Limit Theorem is the key to Gaussian statistics. The theorem states:

The distribution of sample means from any population, regardless of the population distribution, will be normal, with a mean equal to the population mean.

By a "sample," we mean cases from the population, be it individual people or whatever else the population consists of, chosen at random, or at least in a way that is unrelated to the probability of having the characteristic we are interested in. That is what we call an "unbiased" sample. Bias in this context doesn't have anything to do with prejudice, it's just question of whether our probabilities are somehow distorted.

For this to be true, the sample also has to be large enough. It doesn't really start to work until you have samples of at least 30 and you really need bigger samples than that to do anything useful.

So I supposed you would like me to explain the theorem. Well okay then. One way of looking at it is to go back to our coin flips. The underlying distribution of coin flips in the world is not a normal distribution. It looks like this:

For obvious reasons, we call that a "rectangular" distribution. Half the flips are heads, and half the flips are tails. What our normal curve was actually showing us was the distribution of sample means from many samples of arbitrarily large size. The most likely result is half tails and half heads, with results farther and farther from the population mean increasingly unlikely.

In my magic bag, I have an infinite number of green and blue marbles; 80% are green. Obviously, this is not a normal distribution, but, if I take a million samples . . .

I get this distribution. Yep, it's a normal distribution with a mean of .8, the population mean.

The standard deviation of this curve is called the “standard error of the proportion.” The bigger the sample, the smaller it is. We could figure it out by using the binomial formula, computing the proportion of all possible values of the sample mean, and plugging them into the formula for the standard deviation . . . But there is a much simpler way. The formula for the Standard Error of the Proportion is:

As you may recall, p is the probability of one outcome, q the probability of the other, and what you don't know yet is that n is the sample size. So, bigger samples mean smaller standard errors, which is another way of saying that you can have more confidence that your sample mean is close to the population mean. Big samples are good but they cost more. So they're a luxury we don't always have.

Soon, I will march on with this. It will get more interesting, I hope, when we actually put it to use for a good cause.