Comments on the previous post remind me that I don't always explain stuff as well as I should. Most of them time, when you read about the latest amazing discovery in social science (left-handed lesbian steamfitters are more likely to live in neighborhoods with lots of one-way streets, or whatever), it's based on some form of multivariate analysis, and the most common method used is called multiple linear regression. That's the case with the prediction of happiness I talked about yesterday, and the source of the confusion I created. So here, to cut through the fog of math, is my latest attempt at statistics 101.
A linear regression equation is a way of explaining, or predicting, some fact as the sum of the contributions of a bunch of other facts -- in our example, how happy Americans say they are in response to a survey as a function of some of their other basic attributes. The equation looks like this:
Y = C + aX1 + bX2 + cX3 . . . + (whatever letter)Xn
"Y" is called the "dependent variable" (DV) -- in this case happiness, which we are trying to predict or explain.
"C" is called the constant -- think of it as though everybody starts out with some fundamental endowment of happiness, to which the other variables add, or from which they subtract.
X1, X2, X3 up to Xn are called the "independent variables" (IV) -- in our example, they include age, amount of education, income, sex, employment status, etc. (Wait -- how do you express "sex" as a number? Patience -- I'll explain.)
a, b, c etc. up to whatever letter are called "coefficients" -- the amount by which you multiply each variable to get the best possible prediction.
Now, the way you deal with so-called "nominal variables" -- variables which aren't numbers, but qualitative facts such as sex -- is that you make them what are called "dummy variables," coded as zero or one. For example, we might code men as 0 and women as 1. You could think of it as amount of femaleness, which is generally all or nothing, although I suppose in principle transgendered or androgynous people could be .5 or something. Anyway, if you're female you get 100% of the coefficient added to your score, if you're male you get nothing. (The coefficient could turn out to be negative, so it doesn't matter which category we initially choose to code as zero and which we choose to code as one.)
Where you have multiple mutually exclusive possibilities, such as employed, unemployed, student, homemaker, you have to code each of them separately, and you can get a "1" on only one of them. For mathematical reasons that you might be able to discern if you think about it hard enough but which are not worth trying to explain here, you have to leave one of the categories out entirely. That's called the "reference category." People who belong in that category get all zeroes.
That's easy to understand, but now we come to a couple of tricky things. The first is, how do you calculate the coefficients? Stop listening to your IPod for a minute and concentrate. The coefficients are chosen so as to minimize the average (or sum, same thing) of the squared distances of the data points from a line in an imaginary multidimensional space. If you think about a simple two variable regression this isn't hard to visualize. You have a Y axis representing the dependent variable, and the X axis representing the independent variable -- say, income and education. The points tend to fall along a line but they don't make anything like a perfect line. You pick the line that makes the average of the squared distances of the points as small as possible. The linear equation -- Y = C + aX -- specifies that line. In spaces of more than 3 dimensions, it's impossible to visualize, but the mathematics works just fine. The computation is easier than you might expect, but it involves something you didn't study in school, called matrix algebra, which is essentially a method for solving a whole bunch of simultaneous equations.
The second tricky part is that the coefficients are generally converted from their raw form into what are called Beta coefficients, which are the number of standard deviations of the variable (the "Z score") represented by the coefficient. (See my earlier post The significance of significance, for an explanation of these terms.) The reason for doing this is that the units of the different variables aren't necessarily comparable. Education ranges from 0 to maybe 16 years, age (since we're only surveying adults) from 18 to maybe somewhere in the 90s, income from the SSI minimum of a few thousand dollars a year to tens of millions. To properly understand the contribution of each variable to the DV, you need to express them in a way that the units don't matter, i.e. the proportion of the total variability, however it is measured. (To get a better fit, they converted the income variable into the log of income, as I mentioned yesterday. However, the resulting regression equation is still linear because the variable itself is still just a number, as is the coefficient.)
Okay, forget about all that if you want to, here's the stuff that matters. When you put a new variable into the equation, you are likely to change the coefficients on the other variables. Here's an example. I attended a conference once where an investigator presented her findings and made the claim that Latina women are just as likely to have abortions as other women. I knew that wasn't true and in fact, she hadn't found any such thing. Latina women in her sample were considerably less likely to have abortions than non-Latina women, but they were also much more likely to be Catholic. Once she put being Catholic into the equation, the coefficient on being Latina became non-significant. As a matter of fact, if I remember correctly, it turned slightly positive. Howsomever, that obviously does not mean that Latina women are more likely to have abortions than other women; they are still less likely to have abortions, but now we know that it's because they are more likely to be Catholic. Duhh. It follows that those Latina women who are not Catholic are not less likely to have abortions, but they are a comparatively small percentage of all Latina women.
So, now to what I was trying to say about the happiness of women in yesterday's post. It is possible to have a situation in which the average woman surveyed reports being less happy than the average man; but when you calculate your regression equation, the coefficient on being female turns positive. That would be the case if women were more likely than men to have other characterstics associated with being unhappy, such as unemployment, limited education, poor health, and low incomes. It would mean that in spite of these disadvantages, women are less unhappy than a man would be in the same situation.
The coefficient on being female is indeed positive, but it does not follow that the average woman is in fact happier than the average man. I was saying that just eyeballing the coefficients and flying by the seat of my pants, I'm guessing that the regression analysis is not deceptive in this way and that the average woman is in fact slightly happier than the average man. However, I could certainly be wrong about that.
Tuesday, April 15, 2008
Multiple Linear Regression Analysis
Subscribe to:
Post Comments (Atom)
3 comments:
This is really interesting because I didn't know about the equation of linear regression, but Do you think happiness can be measure with this equation? I think that's not possible.
A lot of effective data for myself!
Achat Viagra en ligne
Acheter Viagra France
Post a Comment