Module 1: Introduction to Surveys
Module 2: Getting Started with Stata
Module 3: Understanding Distributions
Module 4: Measures of Central Tendency
Module 5: Bivariate Analysis
Module 6: Simple Regression Analysis
Module 7: Multiple Regression Analysis
Module 8: Discrete Outcome Analysis
Graphing with Stata 8

MULTIPLE REGRESSION ANALYSIS

 

TABLE OF CONTENTS

Introduction
Dummy Variables
Interactions with Dummy Variables
Linear Transformations of non-Linear relationships
Transformations using Squared Terms
Transformations using the natural Logarithm
Example: Further Exploring STD Symptoms Variables
Exercises

 

 

 

 

 

 

 

INTRODUCTION

Lets quickly review what we know about simple regression analysis. In general form, the simple linear regression model has one independent variable (X) and one dependent variable (Y). In multiple regression, the dependent variable Y is assumed to be a function of a set of K independent variables - X1, X2, X3,....Xk. This yields a new regression equation - an extension of the one we saw in Simple Regression:

Y = a + b1X1 + b2X2 + ... + bkXk

As with the simple regression equation, the interpretation of each of these coefficients is straightforward. Each "b" is a partial slope coefficient. Put differently, each "b" coefficient is the slope of the relationship between that particular independent variable X and the dependent variable Y when all other independent variables in the model are equal to zero, or "held constant." For example, the b1 coefficient refers to the slope between X1 and the dependent variable Y when all other variables in the equation, X2, X3, etc., equal zero. Similarly, the value for b2 is the slope for the relationship between X2 and the dependent variable Y, when all other variables, X1, X3, etc., are equal to zero. As in simple regression, the "a" refers to the intercept, also known as the constant. This value is the value of predicted Y (yhat) when all of the independent variables, X1,X2, X3, etc., are equal to zero. Thus, multiple regression allows us to state relationships between two main variables while controlling for other factors - also known as partial effects.

It should be obvious how useful this approach can be for quantitative social researchers, since we are often interested in social phenomena that go beyond a basic bivariate relationship. To expand on our example before, we might be interested in whether the relationship between age of first sex and age varies by gender, or by degree of literacy. This type of question requires multiple regression. This new approach will allow us to investigate the initial relationship while controlling for a 3rd, a 4th, and an x-number of factors.

In the following sections, we will investigate in depth the relationship between sexage, age, educ, and gender. In particular, we are hypothesizing that the age of first sex is dependent on an individual's cohort (represented through the age variable), their highest level of education and their gender.

First, open your data again, and check what the distribution of education values looks like :

use bais.dta, clear

keep if rec_per==1

tab educ

Remember also that we need to deal with outliers and missing values not yet coded as missing:

keep if sexage<=age

Let's type:
 

corr sexage educ
 

             |   sexage     educ
-------------+------------------
      sexage |   1.0000
        educ |   0.0664   1.0000

The correlation between educ and sexage is 0.0664, which is a weak correlation, but suggests that the more years of education an individual has, the higher the age at first sex. However, we do not know to what extent education makes a difference, we just know that it is positively associated with age of first sex. For further understanding this relationship, we need to estimate the regression of sexage on education.

We accomplish this by typing:
 

reg sexage educ
      Source |       SS       df       MS              Number of obs =    2005
-------------+------------------------------           F(  1,  2003) =    8.87
       Model |  72.6737934     1  72.6737934           Prob > F      =  0.0029
    Residual |   16405.025  2003  8.19022714           R-squared     =  0.0044
-------------+------------------------------           Adj R-squared =  0.0039
       Total |  16477.6988  2004  8.22240457           Root MSE      =  2.8619
------------------------------------------------------------------------------
      sexage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |    .055436   .0186102     2.98   0.003     .0189386    .0919333
       _cons |   17.95426    .177452   101.18   0.000     17.60625    18.30227
------------------------------------------------------------------------------

Do you remember how to interpret these results? Lets review the basic regression equation:

Y = a + bX

In our case, this equation becomes:

(predicted sexage) = 17.95426 + .055436(educ)

We can immediately interpret the slope coefficient for education as the number or fraction of years that the age of first sex would decrease by, for an additional year of education. Judging from the size of the t-value (2.98), we can tell that the coefficient is statistically significantly different from zero.

The constant, as discussed before, reflects the value of the dependent variable Y when the independent variables are equal to zero. While this property is technically useful in the calculation of the regression coefficients and calculation of predicted Y values, its actual value is not always of use. Obviously we do not want to ignore it, but we also do not need to dwell on it since it is often not very interpretable. In our current case, it literally says that when education level is zero, predicted age of first sex is 17.95426. The constant is significantly different from zero, as indicated by the t-stat. If, however, we had centered our education variable around the sample's education mean, then the "zero" value would actually be the average level of education. Interpreting the constant in that case would be more useful. Moving along, the R-squared for this regression tells us that education accounts for less than 4% of the variation around the mean of sexage. Although we would caution not to fall into the trap of maximizing the R-squared when we are running regressions, we would probably all agree that this regression with such a low R-squared is not picking up any strong linear relationships between education and age of first sex. If we leave the analysis at that, what implications might this apparent lack of relationship have for government policy towards HIV/AIDS prevention and control?


Lets now try graphing the regression equation:

predict fsexhat

graph twoway scatter sexage educ || line fsexhat educ, ylabel(0(5)40) ytick(0(5)40) xlabel(0(5)25) xtick(0(5)25)



                                                    Figure 1
 

 

 

Issues of Parsimony and Saturation

When thinking about introducing variables into a model, it is important to keep the notions of parsimony and saturation in mind. That is, we should always strive to include ONLY the variables that make sense and that are efficient at capturing the desired social phenomenon. Model building is often a balancing act between parsimony and saturation. When we say that a model is "saturated," we mean that the model has too many variables - it is over specified. A model that is over specified or saturated can often predict each case in the sample perfectly because the model is using up all the degrees of freedom. Therefore, when selecting variables for a model, it is prudent to only include the most necessary variables or risk over specifying the model. With that in mind, lets proceed.
 

Introducing a third variable

At this point, we can consider including our first control variable. It is likely that the age at first sex is not only dependent on years of education, but also on age. By including age in our model, we acknowledge that sexage is also a function of age. It is important to include this factor because perhaps the effect of education on choices about when to first have sex differs across cohorts. If you remember our earlier discussion on how to interpret coefficients, each coefficient in a regression model is a partial effect, meaning that the coefficient reflects the effect of a variable while controlling for the others at 0. In this case it means that when we include age, our coefficient for educ will be the effect of education while controlling for age at 0. Do not think of zero in literal terms, we are not saying that the coefficient of education is the value for a newborn (age 0), but rather think of this "controlling" as the process by which we standardize the effect across all observations (who may have very different levels of education). Enough theory, let's try running the multiple regression model now:

reg sexage educ age

      Source |       SS       df       MS              Number of obs =    2005
-------------+------------------------------           F(  2,  2002) =  116.46
       Model |  1717.23701     2  858.618505           Prob > F      =  0.0000
    Residual |  14760.4617  2002  7.37285801           R-squared     =  0.1042
-------------+------------------------------           Adj R-squared =  0.1033
       Total |  16477.6988  2004  8.22240457           Root MSE      =  2.7153
------------------------------------------------------------------------------
      sexage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1343502   .0184308     7.29   0.000     .0982047    .1704957
         age |   .0837989   .0056109    14.94   0.000     .0727951    .0948027
       _cons |   14.59331   .2810493    51.92   0.000     14.04213    15.14449
-----------------------------------------------------------------------------

Compare our old equation (from above):

(predicted sexage) = 17.95426 + .055436(educ)
--> {R-squared = 0.0044}

To our new multiple regression equation:

(predicted sexage) = 14.59331 + .1343502(educ) + .0837989(age)
--> {R-squared = 0.1042}

Right away we should notice the effect that age has on our model. Notice that the effect of education, controlling for age, is more strongly positive now: this means that for a given age, individuals with more education tend to have first sex at older ages (by .1343502 of a year more). Another way of thinking about these new results is that in the initial model, the "true" effect of education was being masked by the effect of age - which we did not include in the simple regression set-up. Since the coefficient on education increased when we included age, the relationship between age and education was negative - the older an individual is, the more likely they are to have less education (you can check this simply, by running the correlation of education against age).

The R-squared has also increased to 10%, implying that the variation in education and age is enough to explain 10% of the variation in sexage, in our sample.

The addition of a single regressor to the bivariate model probably does not seem that difficult, but as we move forward, you will realize that this is merely the tip of the iceberg.
 

Now that you have been introduced to multiple regression, try the following exercises:

  1. What is the relationship between the number of births that a woman has had, her years of education and the age at first sex? Use regression analysis to answer this question and, as we did at the end of module 6, restrict your analysis to the sample of women who are at least 40 years of age.
  2. Question 1 Answer

 

 

DUMMY VARIABLES

Thus far we have focused on using continuous variables in our regressions. We can extend regression analysis to include categorical variables such as gender, general satisfaction, urban area etc. But how do you include variables whose values are arbitrary? Can we calculate the average gender of a country? How about the average urban setting? The answer is no, but lets find out how these types of variables are useful in regression analysis.
 

What Makes a Dummy Variable a "Dummy" variable?

No, "dummy" variables are not "stupid" variables, in fact they are quite smart and useful! A dummy variable has two properties that make it a "dummy variable." First, it is categorical and non-ordinal (i.e., categories have no rank order). Thus, the number values associated with each category serve only to identify the various groups/categories it represents, but not to assign value or order to any one category. The second, and this is what makes a dummy variable a "dummy variable," is that it is binary in the sense that it has only two values - 0 and 1. Technically, a variable like literacy or location, may have more than 0 and 1 values, but when this type of dummy variable is used in a regression, coefficients are calculated for each category while all the other categories are equal to zero. Thus, if done correctly, even a multi-category variable can be used as a dummy variable because in the end, it is broken up into 0s and 1s.

Dummy variables are useful because they allow us to control for membership within a particular category or group. If we neglected to split a categorical variable into several dummy variables when using it in a regression, we would get invalid results because regression analysis assumes variables to be continuous unless told otherwise. Therefore, if you include a categorical variable like gender into a regression, Stata (or any other statistical program) would recognize it as simply another variable and would not realize that those numbers have no mathematical meaning - Stata does not know if the values in a variable are arbitrary or not. Regression analysis revolves around the use of means and standard deviations, but with categorical variables, means and standard deviations have no meaning.
 


How NOT to use Categorical variables

Lets try the following example of what NOT to do. Let's continue with our previous example of the effect of education on age of first sex. This time let's include literacy in the regression model without considering the fact that it is a categorical variable. We might think that literacy matters separately from education, as not all individuals with the same level of education are necessarily literate to the same degree. First, lets tabulate literacy to see its categories:

. tab literacy, missing
             literacy |      Freq.     Percent        Cum.
----------------------+-----------------------------------
         Reads easily |      1,708       71.83       71.83
Reads with difficulty |        348       14.63       86.46
        Does not read |        319       13.41       99.87
                    . |          3        0.13      100.00
----------------------+-----------------------------------
                Total |      2,378      100.00

 

Let's regress it now:
 

reg sexage educ literacy

      Source |       SS       df       MS              Number of obs =    2002
-------------+------------------------------           F(  2,  1999) =    6.62
       Model |  108.385409     2  54.1927043           Prob > F      =  0.0014
    Residual |  16356.6076  1999    8.182395           R-squared     =  0.0066
-------------+------------------------------           Adj R-squared =  0.0056
       Total |   16464.993  2001  8.22838231           Root MSE      =  2.8605
------------------------------------------------------------------------------
      sexage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0757481   .0209941     3.61   0.000     .0345756    .1169207
    literacy |   .3551987   .1694974     2.10   0.036     .0227887    .6876088
       _cons |   17.35601    .336485    51.58   0.000     16.69611    18.01591
-----------------------------------------------------------------------------

After reviewing these results, how would you interpret the literacy coefficient? Would it make sense to say that for every unit increase in literacy, while controlling for age and education (educ), there is a .3551987 increase in age at first sex? The answer is NO. This is similar to saying that the average literacy in Botswana is 1.41. What would 1 unit of literacy mean? Your guess is as good as mine.

 

The Correct Way

Let's try this same example, except this time we'll do it correctly. To do this we need to call upon a few of our newly found skills. First, we need to split the literacy variable into multiple dummy variables. There are two main ways to accomplish this task. Here we will cover the more familiar way (tab varname, gen(varname)) and then below you will be introduced to a new command that will make it easier - the xi command. We covered this first command in an earlier session:

tab literacy, gen(litid)
[Note: litid will be automatically numbered with sequential numbers]

Then we tabulate our new litid variables to make sure the command worked by typing:

tab1 litid1 litid2 litid3
[Note: tab1 tells Stata to tabulate each variable separately instead of cross tabulating all of them together in one big matrix]
 

tab1 litid1 litid2 litid3 
-> tabulation of litid1  
literacy==R |
eads easily |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        667       28.08       28.08
          1 |      1,708       71.92      100.00
------------+-----------------------------------
      Total |      2,375      100.00
-> tabulation of litid2  
literacy==R |
  eads with |
 difficulty |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,027       85.35       85.35
          1 |        348       14.65      100.00
------------+-----------------------------------
      Total |      2,375      100.00
-> tabulation of litid3  
literacy==D |
    oes not |
       read |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,056       86.57       86.57
          1 |        319       13.43      100.00
------------+-----------------------------------
      Total |      2,375      100.00

Great, our command worked as it should. Each new litid variable is coded as 1 for all people who are of that degree of literacy, and 0 for everyone else. For example, there are 348 individuals who read with difficulty, and 2027 individuals who don't.

Now it's time to run the regression with our newly created dummy variables. We do this by typing:
 

reg sexage age educ litid2 litid3

      Source |       SS       df       MS              Number of obs =    2002
-------------+------------------------------           F(  4,  1997) =   58.38
       Model |  1723.72804     4  430.932011           Prob > F      =  0.0000
    Residual |   14741.265  1997  7.38170504           R-squared     =  0.1047
-------------+------------------------------           Adj R-squared =  0.1029
       Total |   16464.993  2001  8.22838231           Root MSE      =  2.7169
------------------------------------------------------------------------------
      sexage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0830834   .0056512    14.70   0.000     .0720005    .0941663
        educ |   .1426391   .0204928     6.96   0.000     .1024496    .1828287
      litid2 |   .2159715   .1955753     1.10   0.270    -.1675816    .5995246
      litid3 |   .0076977   .4722996     0.02   0.987    -.9185539    .9339493
       _cons |   14.51124   .2956527    49.08   0.000     13.93142    15.09106
------------------------------------------------------------------------------

Our new regression line can be stated as:

(predicted sexage) = 14.51124  + .1426391(educ) +  .0830834(age) +.2159715(litid2)+ .0076977(litid3)

By now, you should be able to interpret the basic regression equation. This new equation is simply an extension of the first regression equation discussed earlier. Let's quickly review it. This equation tells us that for every additional year of education, age at first sex increases by .1426391 of a year, while controlling for age and literacy. It also tells us that for every additional year of age, sexage increases by about .0830834 while controlling for education and literacy. Now, the literacy coefficients tell us that for litid2 (reads with difficulty) there is an added effect of .2159715 of a year over the omitted category (litid1 - reads easily) while controlling for education and age. Similarly, for litid3 (does not read) there is an added effect of .0076977 over individuals who are fully literate, while controlling for education and age.

In general, the litid coefficients show us the effect that literacy has on the age at first sex, after controlling for education and age. None of the literacy effects are significantly different from zero - which implies that there are no significant differences in the age at first sex between individuals at different levels of literacy.
 


Omitted/Reference Categories

There is one important point to keep in mind when interpreting a multiple regression that uses dummy variables. Notice that only 2 litid dummy variables were included in the equation. Why would this be necessary? It is necessary because if we were to include all three dummy variables, we would essentially over specify the model, which we do not want to do. Whenever we use dummy variables, there should always be an omitted category (also known as the reference category), in this case the omitted category is literacy (litid1).

Being "omitted" does not mean that the equation is ignoring that group of people, rather we are telling Stata to only explicitly show us the coefficients for litid2 and litid3. In fact, the coefficient for the omitted category (litid1) can be known from the results above. If you remember our description of what the constant is, you will realize that litid1 can be derived from it. The constant in this case is analogous to a "reservoir" of values, in which all omitted categories get lumped into. Therefore, if the constant represents the value of our dependent variable Y when all other regressors are equal to zero, that means that the "left over" values are used to calculate the constant (in this case those values are those not in the category litid2 or litid3 ). And who is not in the litid2 or litid3 categories? Correct, litid1 (fully literate individuals).

It is important to realize that we did not drop any cases by omitting the litid1 category, we simply "shifted" them into the constant and used them as a comparison group. If we were using another set of dummy variables, gender for example, we would have to choose the reference category for that variable as well. If we chose men as our reference category, we would get a coefficient for women, but not for men. The coefficient for men would be found in the constant. If both gender and literacy were included in a regression model as dummy variables, two omitted categories would be captured and represented by the constant - in our case it would have been literate males.


A Short Cut: The "xi" Option

Although the tab varname, gen(varname) command is useful in creating dummy variables, it is unnecessary. Stata provides us with an easier and more convenient short-cut to specify a categorical variable in a regression equation. The xi command tells Stata to treat the specified variable(s) as categorical - as if they were dummy variables. This command can be used with any Stata command like regress, logistic, probit, etc. Let's try it.

First, we will create and label a new gender variable that is consistent with dummy variable coding - 0s and 1s. Note however, that we could also use the xi command for gender, but we choose not to.

tab gender
tab gender, nol
recode gender 1=0 2=1
label def gender 1 "Female" 0 "Male"
label val gender gender


We have recoded and relabeled the gender variable as 0=male and 1=female.

Now we move on to using the xi command. We continue with our sexage and education example, but now we will be controlling for age, literacy, and gender. By doing so, we are stating not only that age of first sex depends on education, but also on age, gender, and literacy. This time, however, we will be declaring the 'does not read' (literacy ==3) as the reference category. We do this by prefacing the regress command with the char varname[omit] statement. This command is useful when using xi because Stata, by default, selects the first category in the specified variable as the reference category. In our model, the xi: command works by placing it at the beginning of the regression equation and then specifying the variables you want Stata to expand into its constituent categories by "tagging" them with an "i." in front of each target variable. See below:

char literacy[omit] 3
xi:reg sexage educ age i.literacy i.gender


Notice that an "i." is included for the variables literacy and gender. Also remember that we have told Stata to treat category 3 of the literacy variable as the reference category and since we have not specified a specific reference category for gender, Stata will omit its first category - 0, men.

 
      Source |       SS       df       MS              Number of obs =    2002
-------------+------------------------------           F(  5,  1996) =   47.55
       Model |  1752.59423     5  350.518847           Prob > F      =  0.0000
    Residual |  14712.3988  1996  7.37094127           R-squared     =  0.1064
-------------+------------------------------           Adj R-squared =  0.1042
       Total |   16464.993  2001  8.22838231           Root MSE      =  2.7149
------------------------------------------------------------------------------
      sexage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1380934   .0206063     6.70   0.000     .0976813    .1785055
         age |   .0835832   .0056527    14.79   0.000     .0724973    .0946691
_Iliteracy_1 |   .0351196   .4724508     0.07   0.941    -.8914288    .9616681
_Iliteracy_2 |   .2043912   .4818247     0.42   0.671    -.7405408    1.149323
  _Igender_1 |   -.247329   .1249804    -1.98   0.048    -.4924347   -.0022234
       _cons |   14.65515   .5151142    28.45   0.000     13.64493    15.66537
------------------------------------------------------------------------------

What do the results tell us? Right away we should be able to tell that our model explains over 10% of the variation around our independent variable. Next, we should notice that the coefficient for gender is negative. This tells us that in relation to the omitted category (gender=0 - men) everyone within the reported category (women) has a lower age of first sex than the reference group, after controlling for all other variables! Overall, the model tells us that if we know a person's level of education, their age, their gender, and literacy, we are likely to guess their age of first sex 10.6% better than simply guessing the mean sexage in the sample.

Let's consider what our new equation looks like:

(predicted sexage) = 14.65515 + .1380934(educ) +.0835832(age) + .0351196(literate=1, else=0) + .2043912(reads with difficulty=1, else=0) -.247329(female=1, else=0)

The new equation allows us to calculate, for example, the predicted age of first sex for a 50 year women who has 10 years of education but reads with difficulty, or the age of first sex for a 25 year old literate man with 16 years of education. All we need to do is plug in the number of years of education, the age, and either a 1 or a 0 for whether the person falls within the particular category or not. Let's try it.

(50 yr old woman with 10 yrs of ed who reads with difficulty: predicted first sex) = 14.65515 + .1380934(educ) +.0835832(age) + .0351196(literate=1, else=0) + .2043912(reads with difficulty=1, else=0) -.247329(female=1, else=0)

= 14.65515 + .1380934(10) +.0835832(50) + .0351196(0) + .2043912(1) -.247329(1)

--> ANSWER = 20.17231


For a 25 year old literate with 16 years of education, the predicted equation is the following:

(25 yr old literate man with 16 years of ed: predicted sexage) = 14.65515 + .1380934(educ) +.0835832(age) + .0351196(literate=1, else=0) + .2043912(reads with difficulty=1, else=0) -.247329(female=1, else=0)

= 14.65515 + .1380934(16) +.0835832(25) + .0351196(1) + .2043912(0) -.247329(0)


--> ANSWER = 18.98934

What, if anything, do these predicted values assume? Any ideas? How about assuming that each of the non-categorical variables in our equation have a linear relationship with the dependent variable? Does it make sense that older individuals are likely to uniformly be having first sex at older ages? We might think that the relationship between cohort and first sex is non-linear: that as you look at successively younger cohorts, the age at first sex declines, but not linearly. Perhaps this age at first sex falls at a slower and slower rate. We will learn how to control for this curvilinear effect later in this section.
 

Note on Extrapolating Beyond the Data

Let's try calculating the following predicted sexage:

What is the predicted age at first sex for a 90 year old illiterate male with 20 years of education? We can easily carry out the calculations for this question:

(predicted sexage) = 14.65515 + .1380934(20) +.0835832(90) + .0351196(0) + .2043912(0) -.247329(0)

-> (predicted sexage) = 14.65515 + .1380934(20) +.0835832(90)

= ANSWER = 24.97463

Do you see any problems with this example? Does our age variable include people over the age of 64? NO. Extrapolating beyond the available data points is never a good idea because our results apply only to the specific cases used to calculate the model. It is possible that our observed relationship holds for 90 year olds with 20 years of education, but it is also possible that it does not. The point is that without those actual cases in the calculation of the model it is impossible to know. Therefore, we suggest that you never try to extrapolate, predict values, beyond the data points used in the model.

Try these exercises to make sure you understand the basics of interpreting dummy variables in multiple regression analysis.

2. What is the predicted age of first sex for a 30 year old literate woman, with 5 years of education?
Question 2 Answer
 
3. What is the predicted age of first sex for a 45 year old man, who reads with difficulty, and has 5 years of education?
Question 3 Answer

 

 

INTERACTIONS WITH DUMMY VARIABLES

Thus far, we have only dealt with the additive effects of dummy variables. That is, the assumption has been that for each independent variable Xi, the amount of change in our dependent variable Y is the same, regardless of the values of the other independent variables in the equation. This assumption allows us to interpret the partial coefficients as the effect of a variable while controlling for the other independent variables in the model.

The additive assumption, however, does not always hold. In such cases, the partial effect of a given independent variable cannot be interpreted as the effect of the variable while all others are being held constant, instead these peculiar relationships depend on the specific values of other independent variables in the model. In these cases it is hypothesized that the independent variable Xi is linearly related to the dependent variable Y, however, that linear relationship depends on a different independent variable in the model. Interactions are perhaps best visualized and understood in the case of dummy variables.

For instance, in our example below, we interact the categories of education and gender. In effect, what we are testing with an interacted model is whether or not the linear relationship between an independent variable Xi and the dependent variable Y is dependent on the values of a different independent variable in the model. More intuitively, by interacting education and gender, we are testing whether the effect of education on the age of first sex is different for men than for women.

In general, we can illustrate what we mean by the additive effect of dummy variables in regression with the graph below. Each category of an independent dummy variable has a slope as depicted by the lines in the graph. For instance, we can imagine the predicted effect of education on sexage looking like the lines below. As it stands, this first graph suggests that the effect of gender is similar across all education levels, the only apparent difference is in magnitude between males and females -- both slopes are identical for each unit change in Xi. In the graph, Y = sexage, X1 = education, and the coefficients b1 are for education, and b2 for gender.

 

                                                       Graph 1  

In the second graph, we find a hypothetical interaction effect. We can imagine this effect to be similar in form to that of the interaction between education and gender. That is, the effect of education (slope of the line) depends on the particular gender of the individual. In this case, we find that the upper-most line on the graph has a steeper slope than the line below it, thus the effect of education depends on the value of Xi -- in this case, the gender of the individual.



                                                      Graph 2

 

Let's now investigate how the theory measures up to empirical findings. Creating an interaction term with Stata is as easy as inserting an asterisk "*" between the two variables you wish to interact. In essence, this tells Stata to multiply these two variables together. Or, you can also generate each interaction term independently by generating a variable that multiplies the two desired variables together. In the immediate example below, we use the easiest of these two approaches, but to see the second approach click here.

First, we choose to use "Does not read at all" as the reference category (literacy==3).
Then, to interact education and gender, we simply include an asterisk between education & i.gender.

char literacy[omit] 3
xi:reg sexage age i.gender*educ i.literacy

 
      Source |       SS       df       MS              Number of obs =    2002
-------------+------------------------------           F(  6,  1995) =   44.03
       Model |  1925.50914     6   320.91819           Prob > F      =  0.0000
    Residual |  14539.4839  1995  7.28796184           R-squared     =  0.1169
-------------+------------------------------           Adj R-squared =  0.1143
       Total |   16464.993  2001  8.22838231           Root MSE      =  2.6996
------------------------------------------------------------------------------
      sexage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0874985    .005678    15.41   0.000     .0763631     .098634
  _Igender_1 |  -1.799912   .3421138    -5.26   0.000     -2.47085   -1.128975
        educ |   .0589642   .0261485     2.25   0.024      .007683    .1102455
_IgenXeduc_1 |   .1735204   .0356236     4.87   0.000     .1036571    .2433836
_Iliteracy_1 |   -.028932    .469968    -0.06   0.951    -.9506115    .8927474
_Iliteracy_2 |   .1402214    .479286     0.29   0.770    -.7997321    1.080175
       _cons |    15.3257   .5303835    28.90   0.000     14.28553    16.36586
------------------------------------------------------------------------------

As with the previous regression results, we find coefficients for the main effects of educ, age, _Iliteracy_1, _Iliteracy_2, and _Igender_1, but now we also find the interaction effects of years of education and gender (_IgenXeduc_1).

When interpreting interaction effects, it is important to keep in mind that the main effect for the variables that were interacted are no longer "available" for interpretation. That is, interaction effects supersede the original main effects and thus render them obsolete, however, we still use them to calculate any estimated yhat value. For example, if we were interested in calculating the sexage for a literate female aged 35 with a 12 year level of education, we compute the following:


predicted sexage = 15.3257 + 35(.0874985 ) + -1.799912(1) + 12(.0589642) + 12(.1735204) + 1(-.028932) + 0(.1402214 )

predicted sexage =19.349116


4. How would you interpret the interaction effect?
Question 4 Answer

 

 

LINEAR TRANSFORMATIONS OF NON-LINEAR RELATIONSHIPS

Thus far, we have assumed linear relationships for all of our regression models. In fact, a linear relationship is a basic requirement for regression analysis. Empirically, however, variables are often not associated in a linear fashion. Yet this reality hardly precludes regression analyses from accurately predicting and describing real world phenomenon. In this section we will show you two basic approaches to achieving that. By using a quadratic term or by taking the natural logarithm of a term we can transform non-linear relationships into approximately linear and vastly improve the fit of a regression line.

Note: Logarithmic and Quadratic transformations are not restricted to multiple regression, however, we have placed them in the multiple regression module because they are rather advanced topics and should only be addressed after one has a clear understanding of all of the material in all lessons prior to this section.

 

Transformations using Squared Terms

An often used squared transformation is the square of age. Researchers often include both age and age2 in regression models because it allows the effect of one-year increase in age to change as a person gets older. That is, the effect of age is not likely to remain the same as we get older. By including age2, the effect of age is allowed to vary across years of age.

gen age2=age*age

regress sexage age
predict yhat1, xb
line sexage yhat1 age, sort



                                                    Figure 2

 

regress sexage age age2
predict yhat2, xb
line sexage yhat2 age, sort



                                                    Figure 3

 

This graph allows us to see the effect of the squared term - age2.
How would we interpret the output from a regression of sexage on age and age2, among other variables?

char literacy[omit] 3

xi:reg sexage age age2 i.gender*educ i.literacy

      Source |       SS       df       MS              Number of obs =    2002
-------------+------------------------------           F(  7,  1994) =   45.34
       Model |  2260.85646     7  322.979494           Prob > F      =  0.0000
    Residual |  14204.1366  1994  7.12343859           R-squared     =  0.1373
-------------+------------------------------           Adj R-squared =  0.1343
       Total |   16464.993  2001  8.22838231           Root MSE      =   2.669
------------------------------------------------------------------------------
      sexage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .3039383   .0320409     9.49   0.000     .2411012    .3667754
        age2 |  -.0029846    .000435    -6.86   0.000    -.0038377   -.0021315
  _Igender_1 |  -1.766521   .3382652    -5.22   0.000    -2.429912   -1.103131
        educ |   .0483948   .0258975     1.87   0.062    -.0023942    .0991839
_IgenXeduc_1 |   .1665963   .0352336     4.73   0.000     .0974977    .2356949
_Iliteracy_1 |   .0235548    .464696     0.05   0.960    -.8877857    .9348954
_Iliteracy_2 |   .1645597   .4738585     0.35   0.728    -.7647501    1.093869
       _cons |   11.90497   .7235444    16.45   0.000     10.48599    13.32395
------------------------------------------------------------------------------

In terms of our coefficients, we find that each year of education increases age of first sex by 0.05 of a year; that age increases sexage up to the age of 50.917 and thereafter decreases them (because quadratic ax2 + bx + c turns over at x = -b/2a, which for our age and age2 coefficients is -.3039383 /(2 x  -.002984 ) = 50.917).
 

 

Transformations Using the Natural Logarithm

Often it is desirable to run a regression using the natural logarithm (to the base e) of a variable instead of the variable itself. For instance, if the graph of the dependent variable on the independent variable shows that the relationship is not linear, making one or both of the variables logarithmic can sometimes produce a linear relationship. Therefore, although a linear relationship might not exist between between two variables, a linear relationship might exist between the natural logarithms of the two variables. Logarithmic transformation also lessens the influence of outliers (which can sometimes drastically affect the slope of the regression line) because the natural logarithm of a variable is much less sensitive to extreme observations than is the variable itself.

As an aside: Income is a variable that is often transformed using its natural log, although we are not fortunate enough to have income as a variable in this data. When we do the log transformation, the impact of each additional dollar decreases as income increases. That is, after a certain point more money does not make that much more of difference. For example, earning 2 billion pula a year versus earning 3 billion pula will probably not have as much of an effect on how many beers we drink, but earning only 100 pula per year versus 1000 pula is likely make a huge difference.

 

 

EXAMPLE: FURTHER EXPLORING STD SYMPTOMS VARIABLES

Now that we have some background in multiple regression, let's look at another example in more detail. Information about the observable symptoms of STD's is important for individuals to have, as it can help them to know when it is necessary to seek treatment for themselves. It is also crucial to be able to recognize these symptoms in one's sexual partners, in order that relevant protection measures can be chosen. Finally, it has been found that individuals are also more likely to contract HIV when they have other STD's, than when they don't.

In module 5, we constructed a composite measure of knowledge about the signs of STD's, using questions Q404 and Q405 in the questionnaire: wscore and mscore. We scored individuals on how many answers they volunteered. Let's investigate the information that people have about the signs of STD's in a man. Do you think men or women are likely to score better on this measure? Let's open the data, and tab to find out:

keep if rec_per==1
replace gender=0 if gender==2
lab def gender 0 "Female" 1 "Male"
lab val gender gender
egen mscore=robs(stdsign1m-stdsign11m)
egen wscore=robs(stdsign1w-stdsign11w)
tab mscore gender, row
           |   sex of respondent
    mscore |    Female       Male |     Total
-----------+----------------------+----------
         0 |     1,030        800 |     1,830 
           |     56.28      43.72 |    100.00 
-----------+----------------------+----------
         1 |       341        239 |       580 
           |     58.79      41.21 |    100.00 
-----------+----------------------+----------
         2 |       400        363 |       763 
           |     52.42      47.58 |    100.00 
-----------+----------------------+----------
         3 |       282        286 |       568 
           |     49.65      50.35 |    100.00 
-----------+----------------------+----------
         4 |        96        102 |       198 
           |     48.48      51.52 |    100.00 
-----------+----------------------+----------
         5 |        29         44 |        73 
           |     39.73      60.27 |    100.00 
-----------+----------------------+----------
         6 |         7         10 |        17 
           |     41.18      58.82 |    100.00 
-----------+----------------------+----------
         7 |         7         10 |        17 
           |     41.18      58.82 |    100.00 
-----------+----------------------+----------
         8 |         6         13 |        19 
           |     31.58      68.42 |    100.00 
-----------+----------------------+----------
         9 |         2         14 |        16 
           |     12.50      87.50 |    100.00 
-----------+----------------------+----------
        10 |         3         18 |        21 
           |     14.29      85.71 |    100.00 
-----------+----------------------+----------
        11 |         2          9 |        11 
           |     18.18      81.82 |    100.00 
-----------+----------------------+----------
     Total |     2,205      1,908 |     4,113 
           |     53.61      46.39 |    100.00 

Note that at the lower scores, the female proportion is larger than the male proportion, while at the higher scores, the ranking reverses. Men appear to score higher than women on this question. What is the difference in mean scores between men and women, on this question?

sort gender
by gender: sum mscore
_______________________________________________________________________________
-> gender = Female
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      mscore |    2205    1.235828   1.514505          0         11
_______________________________________________________________________________
-> gender = Male
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      mscore |    1908    1.619497   2.009124          0         11

 

The females have a lower average score than men. We could also have done this another way:

tab gender, sum(mscore)
tab gender, sum(wscore)
. tab gender, sum(mscore)
     sex of |          Summary of mscore
 respondent |        Mean   Std. Dev.       Freq.
------------+------------------------------------
     Female |   1.2358277   1.5145048        2205
       Male |   1.6194969   2.0091238        1908
------------+------------------------------------
      Total |   1.4138099   1.7714565        4113
. tab gender, sum(wscore)
     sex of |          Summary of wscore
 respondent |        Mean   Std. Dev.       Freq.
------------+------------------------------------
     Female |   1.3981859   1.4908056        2205
       Male |   1.2368973   1.8666789        1908
------------+------------------------------------
      Total |   1.3233649    1.677408        4113

 

So, women seem to have higher average knowledge scores than men on the question about signs of STD in women, and lower knowledge scores than men on the question about signs of STD in men. Is this knowledge gap correlated with any other individual-level variables?

corr mscore age if gender==0

             |   mscore      age
-------------+------------------
      mscore |   1.0000
         age |   0.1229   1.0000
 
corr mscore age if gender==1

             |   mscore      age
-------------+------------------
      mscore |   1.0000
         age |   0.2079   1.0000

 

More knowledge is associated with being older, but more strongly for women than for men. What about education?

corr mscore educ if gender==0
             |   mscore     educ
-------------+------------------
      mscore |   1.0000
        educ |   0.3082   1.0000
 
corr mscore educ if gender==1
             |   mscore     educ
-------------+------------------
      mscore |   1.0000
        educ |   0.3120   1.0000

 

It's good to see that more education is correlated with a higher score on the knowledge test for signs of STD in men! This linear relationship is slightly stronger for men than for women, although this might be because men get more education than women. Could we find out whether men obtain more education than women on average?

tab gender, sum(educ)
            |    Summary of number of years in
     sex of |               school
 respondent |        Mean   Std. Dev.       Freq.
------------+------------------------------------
     Female |   8.1032122   3.0844942        1899
       Male |   8.0471272   3.6845547        1549
------------+------------------------------------
      Total |   8.0780162   3.3669331        3448

We find that this suspicion is not confirmed; men have slightly less education on average, than women. To investigate the effect of education on mscore without the contaminating effects of gender, we need to run a multiple regression. We need to be able to control for the effects of gender when examining the effect of education on the score. What other variables do you think would be important for explaining an individual's score on this question?

We will include gender, age, age-squared, education, location and whether you have had any information about HIV.

replace hivinfo=. if hivinfo==7
replace hivinfo=0 if hivinfo==2
lab def yesno 0 "NO" 1 "YES"
lab val hivinfo yesno
gen age2=age*age
xi: reg mscore age age2 educ gender i.location hivinfo
 
      Source |       SS       df       MS              Number of obs =    3236
-------------+------------------------------           F(  7,  3228) =   73.07
       Model |  1445.79312     7  206.541874           Prob > F      =  0.0000
    Residual |  9124.31411  3228  2.82661528           R-squared     =  0.1368
-------------+------------------------------           Adj R-squared =  0.1349
       Total |  10570.1072  3235  3.26742109           Root MSE      =  1.6813
------------------------------------------------------------------------------
      mscore |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |     .07016   .0120684     5.81   0.000     .0464974    .0938226
        age2 |  -.0007176   .0001849    -3.88   0.000    -.0010802   -.0003551
        educ |   .1290284   .0099732    12.94   0.000     .1094739    .1485828
      gender |   .5141404   .0599439     8.58   0.000     .3966085    .6316723
_Ilocation_2 |  -.1008049   .0802571    -1.26   0.209    -.2581648    .0565551
_Ilocation_3 |   .0487271   .0718908     0.68   0.498    -.0922292    .1896833
     hivinfo |   .3013267   .0674745     4.47   0.000     .1690296    .4336239
       _cons |    -1.1434   .1776206    -6.44   0.000    -1.491661   -.7951396
------------------------------------------------------------------------------

Here, it appears that age, education, gender and hivinfo are all significantly and positively related to how much you know about signs of STD in a man. Being a man increases your score by 0.52 points, while having had some information about HIV increases your score by almost 0.3 points. Living in an urban village is associated with a reduction in your score, although this coefficient is not statistically different from zero.

Do you think that someone who reports having an unusual discharge in the past 12 months would be likely to get a higher than average or lower than average score on this question? We can test this hypothesis, by including a dummy for discharge. In addition, it's plausible that someone who has heard about STD's is also more likely to score better on the question. By including a dummy variable for std, we can check whether this is the case.

replace std=. if std==7
replace std=0 if std==2
lab val std yesno
replace discharge=0 if discharge==2
lab val discharge yesno
xi: reg mscore age age2 educ gender i.location hivinfo std discharge
 
      Source |       SS       df       MS              Number of obs =    2352
-------------+------------------------------           F(  9,  2342) =   72.86
       Model |  1664.55046     9  184.950051           Prob > F      =  0.0000
    Residual |  5945.02054  2342  2.53843747           R-squared     =  0.2187
-------------+------------------------------           Adj R-squared =  0.2157
       Total |    7609.571  2351  3.23673799           Root MSE      =  1.5932
------------------------------------------------------------------------------
      mscore |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0050802   .0156432     0.32   0.745    -.0255956    .0357561
        age2 |   .0000858   .0002187     0.39   0.695    -.0003431    .0005147
        educ |   .0874022   .0105325     8.30   0.000     .0667483    .1080562
      gender |   .6330814    .067364     9.40   0.000     .5009821    .7651808
_Ilocation_1 |  -.0790373   .0793899    -1.00   0.320     -.234719    .0766445
_Ilocation_2 |  -.1604419   .0838224    -1.91   0.056    -.3248158     .003932
     hivinfo |   .2933726   .0781833     3.75   0.000     .1400569    .4466882
         std |   1.572655   .1209486    13.00   0.000     1.335478    1.809833
   discharge |   .4511467   .1486412     3.04   0.002     .1596646    .7426287
       _cons |  -1.104994   .2149919    -5.14   0.000    -1.526588   -.6833998
------------------------------------------------------------------------------

The R-squared increases in this regression, meaning that we explain more of the variation in the mscore variable using the set of variables including std and discharge, than using the set of variables excluding these variables. In fact, individuals who report they have heard of STD's increase their scores by over 1.5 points, relative to those who have not heard of STD's before. The two new variables are also both significant at the 1% level.

It is possible that the effect of some of the X-variables on your score is different, whether you are male or female, and whether we are considering the variable wscore or mscore. Let's create an interaction term between gender and education, to deal with this possibility for one X variable:

gen interact1=gender*educ
xi: reg mscore age age2 educ gender i.location hivinfo std discharge interact1
 
      Source |       SS       df       MS              Number of obs =    2352
-------------+------------------------------           F( 10,  2341) =   65.66
       Model |  1666.79608    10  166.679608           Prob > F      =  0.0000
    Residual |  5942.77492  2341  2.53856255           R-squared     =  0.2190
-------------+------------------------------           Adj R-squared =  0.2157
       Total |    7609.571  2351  3.23673799           Root MSE      =  1.5933
------------------------------------------------------------------------------
      mscore |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0050771   .0156435     0.32   0.746    -.0255995    .0357537
        age2 |   .0000912   .0002188     0.42   0.677    -.0003378    .0005203
        educ |   .0971307   .0147624     6.58   0.000      .068182    .1260795
      gender |   .7833041   .1733459     4.52   0.000     .4433767    1.123232
_Ilocation_1 |  -.0817018   .0794423    -1.03   0.304    -.2374865    .0740829
_Ilocation_2 |  -.1608324   .0838255    -1.92   0.055    -.3252124    .0035476
     hivinfo |    .293136   .0781856     3.75   0.000     .1398157    .4464562
         std |    1.57158    .120957    12.99   0.000     1.334386    1.808774
   discharge |   .4491987   .1486593     3.02   0.003     .1576811    .7407162
   interact1 |  -.0176671   .0187841    -0.94   0.347    -.0545024    .0191681
       _cons |  -1.189945   .2331991    -5.10   0.000    -1.647243   -.7326465
------------------------------------------------------------------------------

Here, the interaction term is negative, meaning that the total impact of an extra year of education on your score if you are male is .0971307 -.0176671 = .0794636. Thus, an extra year of education adds more to a woman's score than a man's score, in the question about signs of STD's in men.

Do you think we would observe the reverse relationship if we were investigating wscore?

xi: reg wscore age age2 educ gender i.location hivinfo std discharge interact1
      Source |       SS       df       MS              Number of obs =    2352
-------------+------------------------------           F( 10,  2341) =   51.40
       Model |  1241.22005    10  124.122005           Prob > F      =  0.0000
    Residual |  5653.21362  2341  2.41487126           R-squared     =  0.1800
-------------+------------------------------           Adj R-squared =  0.1765
       Total |  6894.43367  2351  2.93255367           Root MSE      =   1.554
------------------------------------------------------------------------------
      wscore |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0026582   .0152577     0.17   0.862    -.0272618    .0325781
        age2 |   .0000892   .0002134     0.42   0.676    -.0003293    .0005077
        educ |   .1022932   .0143983     7.10   0.000     .0740585    .1305279
      gender |   .3103542     .16907     1.84   0.067    -.0211883    .6418968
_Ilocation_1 |  -.0108336   .0774828    -0.14   0.889    -.1627756    .1411084
_Ilocation_2 |  -.1916175   .0817578    -2.34   0.019    -.3519428   -.0312923
     hivinfo |   .2382015    .076257     3.12   0.002     .0886632    .3877399
         std |   1.438455   .1179734    12.19   0.000     1.207112    1.669798
   discharge |   .4702784   .1449924     3.24   0.001     .1859516    .7546052
   interact1 |  -.0492751   .0183208    -2.69   0.007    -.0852018   -.0133485
       _cons |  -.8026062   .2274468    -3.53   0.000    -1.248624    -.356588
------------------------------------------------------------------------------

The effect of one more year of education on your score if you are a man is thus = .1022932   -.0492751 = .0530181, which is still lower than the 0.10 point increase in score for a woman with one more year of education. The marginal effect of a year's worth of education on the knowledge of men about signs of STD's in males and females is lower than the marginal effect of a year's worth of education on the knowledge of women about these signs.

Sometimes, researchers think that the marginal effects of all variables on the dependent variable are likely to be different for men and women. We can generate the relevant coefficients for this flexible functional form by creating interaction terms for every variable, and including them in the regression as well. However, this is a very long-winded way to proceed, and hinders interpretation, so instead we will run our original regression over the separate samples of men and women:

For the question about signs of STD's in women:

(A)
xi: reg wscore age age2 educ i.location hivinfo std discharge if gender==0
      Source |       SS       df       MS              Number of obs =    1360
-------------+------------------------------           F(  8,  1351) =   45.26
       Model |  668.796355     8  83.5995443           Prob > F      =  0.0000
    Residual |  2495.24409  1351  1.84696083           R-squared     =  0.2114
-------------+------------------------------           Adj R-squared =  0.2067
       Total |  3164.04044  1359  2.32821225           Root MSE      =   1.359
------------------------------------------------------------------------------
      wscore |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0021618    .017557    -0.12   0.902    -.0366037    .0322801
        age2 |   .0002212   .0002444     0.91   0.366    -.0002582    .0007005
        educ |   .1104107   .0133668     8.26   0.000     .0841887    .1366327
_Ilocation_1 |   .0087054   .0915275     0.10   0.924     -.170846    .1882568
_Ilocation_2 |  -.0558838   .0912835    -0.61   0.541    -.2349566     .123189
     hivinfo |   .1474282   .0860742     1.71   0.087    -.0214254    .3162819
         std |     1.3919   .1417423     9.82   0.000     1.113841    1.669959
   discharge |   .5356415   .1569342     3.41   0.001     .2277803    .8435027
       _cons |  -.8038681   .2487702    -3.23   0.001    -1.291886   -.3158503
------------------------------------------------------------------------------
(B)
xi: reg wscore age age2 educ i.location hivinfo std discharge if gender==1
      Source |       SS       df       MS              Number of obs =     992
-------------+------------------------------           F(  8,   983) =   22.63
       Model |  576.990063     8  72.1237579           Prob > F      =  0.0000
    Residual |  3132.96861   983  3.18715016           R-squared     =  0.1555
-------------+------------------------------           Adj R-squared =  0.1487
       Total |  3709.95867   991  3.74365153           Root MSE      =  1.7853
------------------------------------------------------------------------------
      wscore |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0121779   .0270297     0.45   0.652    -.0408647    .0652206
        age2 |  -.0001408   .0003807    -0.37   0.711    -.0008878    .0006062
        educ |   .0470636   .0160778     2.93   0.003     .0155129    .0786143
_Ilocation_1 |  -.0307476    .133258    -0.23   0.818    -.2922505    .2307552
_Ilocation_2 |  -.4094038   .1521044    -2.69   0.007    -.7078905   -.1109171
     hivinfo |   .3612295    .139291     2.59   0.010     .0878876    .6345714
         std |   1.484963   .1992645     7.45   0.000     1.093931    1.875996
   discharge |    .394767   .2834684     1.39   0.164    -.1615057    .9510398
       _cons |  -.5658047   .3518007    -1.61   0.108    -1.256171    .1245619
------------------------------------------------------------------------------

For the question about signs of STD's in men:

(C)
xi: reg mscore age age2 educ i.location hivinfo std discharge if gender==0
      Source |       SS       df       MS              Number of obs =    1360
-------------+------------------------------           F(  8,  1351) =   34.61
       Model |   562.25936     8    70.28242           Prob > F      =  0.0000
    Residual |  2743.78476  1351  2.03092876           R-squared     =  0.1701
-------------+------------------------------           Adj R-squared =  0.1652
       Total |  3306.04412  1359  2.43270354           Root MSE      =  1.4251
------------------------------------------------------------------------------
      mscore |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0127081   .0184106    -0.69   0.490    -.0488246    .0234084
        age2 |   .0003497   .0002562     1.36   0.173     -.000153    .0008523
        educ |   .1121646   .0140167     8.00   0.000     .0846677    .1396615
_Ilocation_1 |  -.0442112   .0959776    -0.46   0.645    -.2324926    .1440701
_Ilocation_2 |  -.0103502   .0957218    -0.11   0.914    -.1981296    .1774293
     hivinfo |   .2862898   .0902592     3.17   0.002     .1092263    .4633533
         std |   1.192142    .148634     8.02   0.000     .9005638    1.483721
   discharge |   .4550357   .1645645     2.77   0.006      .132206    .7778654
       _cons |  -.7638876   .2608656    -2.93   0.003    -1.275633   -.2521419
------------------------------------------------------------------------------
(D)
xi: reg mscore age age2 educ i.location hivinfo std discharge if gender==1
      Source |       SS       df       MS              Number of obs =     992
-------------+------------------------------           F(  8,   983) =   39.63
       Model |  1009.16253     8  126.145316           Prob > F      =  0.0000
    Residual |  3128.98969   983  3.18310243           R-squared     =  0.2439
-------------+------------------------------           Adj R-squared =  0.2377
       Total |  4138.15222   991  4.17573382           Root MSE      =  1.7841
------------------------------------------------------------------------------
      mscore |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0266837   .0270126     0.99   0.323    -.0263253    .0796926
        age2 |  -.0002282   .0003804    -0.60   0.549    -.0009748    .0005183
        educ |   .0632624   .0160675     3.94   0.000     .0317318     .094793
_Ilocation_1 |   -.184016   .1331734    -1.38   0.167    -.4453528    .0773207
_Ilocation_2 |  -.4350578   .1520078    -2.86   0.004    -.7333549   -.1367607
     hivinfo |   .2756452   .1392025     1.98   0.048     .0024769    .5488135
         std |    1.96974   .1991379     9.89   0.000     1.578956    2.360524
   discharge |   .4981576   .2832883     1.76   0.079    -.0577618    1.054077
       _cons |  -.7945734   .3515772    -2.26   0.024    -1.484501   -.1046453
------------------------------------------------------------------------------

Let's concentrate on the variables which are statistically significant at the 1% or 5% level. Here, education has a much smaller effect on the score of men than women, for both of the score variables. This confirms what we saw earlier in the model with one interaction term for gender*education: that the marginal effect of education on the male scores is smaller than the marginal effect of that same year of education for the female scores.

Having heard about STD's (std) increases the scores of men more than women when it comes to information about the symptoms in both sexes (compare equations (C) with (D) and (A) with (B)). Perhaps this implies that the men receive better quality of information than women. It might also imply that the form in which information about STD's was conveyed allowed men to more easily absorb these facts.

However, being a male living in an urban village significantly reduces your score on both variables, whereas this informational gap between urban village-dwelling and other-dwelling individuals does not seem to be present for women.

This set of regressions that we have run indicates that many variables could have different effects for different groups of individuals: in this case, men and women. Sometimes, these differences may be captured in interaction terms, whereas at other times, we may want to specify completely separate models for each of these groups.

 

 

EXERCISES

  1. Is the correlation between age and age at first sex different for individuals with different religious beliefs?
  2. Exercise 1 Answer
  3. What is the correlation between level of education and whether an individual has ever heard or seen information about HIV?
  4. Exercise 2 Answer
  5. What is the average difference in the age of first partner for men and women? Think about how you would answer this question using the tab, sum command and using simple regression?
  6. Exercise 3 Answer
  7. What is the relationship between the number of people in the household and the number of rooms in the house? If you had to run a regression (using the entire data set, not just those individuals in the individual questionnaire), what would be your dependent variable? Why?
  8. Exercise 4 Answer
  9. Let's suppose that  the number of people in a family determines the size of the house that it lives in. If so, an additional person is likely to make a family acquire how many more rooms (on average)? Please show the graph for this relationship.
  10. Exercise 5 Answer
  11. How is the size of a house (in terms of number of rooms) affected by family size and whether the head of the household works or not? Do these variables significantly explain/predict changes in total school expenditure? Why or why not? [You will have to create a variable for 'whether the head of the household works', and remember to keep only one observation per household in your regression.]
  12. Exercise 6 Answer

 

BACK TO TOP