Map of life expectancy at birth from Global Education Project.

Monday, April 26, 2010

Gaussing onward . . .

Ana writes, "One of the BIG problems of statistics is that analysts use and abuse measures and stats developed for the ‘normal’ curve, applying them when not appropriate." I'm going to get to critical perspectives on the use and abuse of statistics, which I hope will be more interesting, but I hope people will be patient while I continue my march through the basics.

We derived the normal curve from a binomial distribution, but of course we aren't only interested in binomials -- variables that can take only two values, such as coin flips or whether or not somebody tests positive for a disease, for example. We're also interested in the distribution of variables that can have many values, or even continuous values such as height, weight and blood pressure. Our binomial distribution started out as coin flips, but what we actually wound up charting was a continuous variable -- the proportion of heads in an arbitrarily large series of coin flips.

In the Real World, it's quite rare to find a distribution that is perfectly normal like our coin flip example. Some real world distributions look sorta kinda normal, and many do not. (The reason that the distribution of IQs appears very close to normal is because they intentionally design and score the test so that it will be, BTW. Many of the normal looking curves you will see are really artificial.) Nevertheless, the normal distribution can be useful in thinking about real world populations that aren't normally distributed. We're heading toward the explanation for that but first we need a few more basic ideas.

The Standard Deviation is a measure of dispersion – the tendency of values to cluster near the mean, or be more spread out. You can calculate it regardless of whether a distribution is normal, but for really weirdly skewed distributions it might be misleading, or at least not very informative. Here's how you compute it:

  • Calculate the mean: Add up all of the values in your population or sample, and then divide by the number of cases. The "mean" is a synonym for what we ordinarily call the average. We can abbreviate it "M". (Sometimes the Greek letter μ (mu) is used, but why be pretentious?)


  • For each case in your sample, subtract M from its value -- X. The result could be a negative or a positive number, obviously, depending on whether X is larger or smaller than M. Then square the result, i.e. calculate (X-M) * (X-M). This way, you'll convert all the numbers into positive numbers.


  • Add up all the resulting numbers, then divide by the total number of cases, N. You have now computed a new kind of average, the average of the squared distances from the mean of all your cases.


  • This is called the variance. Usually, people then take the square root of the whole thing, to make up for the fact that the original values were all squared. The result is called the Standard Deviation, abbreviated sd


We can write the formula for this easily enough. If formulas make you sick, just avert your eyes.



The thing that looks like a big "E" is actually the Greek letter sigma, which is like our "S". It stands for sum. That means you add up the values of each individual case, just like I said. The rest of it is also just like I said -- what you are adding up is X-M squared, then dividing by N, then taking the square root of the whole thing. Tah Dah! Now you can read a fancy mathematical formula.

MAJOR IMPORTANT FACT: In a normal distribution, about two thirds (.6826) of all values are within one sd of the mean; about 95% (.9545) are within two sd of the mean; more than 99% (.9973) are within three sd of the mean. This is true of all normal distributions.

Now we are getting close to the magical key. Next time!

No comments: