TABLE OF CONTENTS
Introduction
Dummy Variables
Interactions with Dummy Variables
Linear Transformations of non-Linear relationships
Transformations using Squared
Terms
Transformations using the natural Logarithm
Example: Further Exploring STD Symptoms Variables
Exercises
INTRODUCTION
Lets quickly review what we know about simple regression analysis. In general
form, the simple linear regression model has one independent variable (X) and
one dependent variable (Y). In multiple regression, the dependent variable Y is
assumed to be a function of a set of K independent variables - X1, X2,
X3,....Xk. This yields a new regression equation - an
extension of the one we saw in Simple Regression:
Y = a + b1X1 + b2X2 + ... + bkXk
As with the simple regression equation, the interpretation of each of these
coefficients is straightforward. Each "b" is a partial slope coefficient. Put
differently, each "b" coefficient is the slope of the relationship between that
particular independent variable X and the dependent variable Y when all other
independent variables in the model are equal to zero, or "held constant." For
example, the b1 coefficient refers to the slope between X1 and the dependent
variable Y when all other variables in the equation, X2, X3,
etc., equal zero. Similarly, the value for b2 is the slope for the
relationship between X2 and the dependent variable Y, when all other
variables, X1, X3, etc., are equal to zero. As in simple
regression, the "a" refers to the intercept, also known as the constant. This
value is the value of predicted Y (yhat) when all of the independent variables,
X1,X2, X3, etc., are equal to zero. Thus,
multiple regression allows us to state relationships between two main variables
while controlling for other factors - also known as partial effects.
It should be obvious how useful this approach can be for quantitative social
researchers, since we are often interested in social phenomena that go beyond a
basic bivariate relationship. To expand on our example before, we might be
interested in whether the relationship between age of first sex and age varies
by gender, or by degree of literacy. This type of question requires multiple
regression. This new approach will allow us to investigate the initial
relationship while controlling for a 3rd, a 4th, and an x-number of factors.
In the following sections, we will investigate in depth the relationship between
sexage, age,
educ,
and
gender. In particular, we are
hypothesizing that the age of first sex is dependent on an individual's cohort
(represented through the age variable), their highest level of education and
their gender.
First, open your data again, and check what the distribution of education values
looks like :
use bais.dta, clear
keep if rec_per==1
tab educ
Remember also that we need to deal with outliers and missing values not yet
coded as missing:
keep if sexage<=age
Let's type:
corr sexage educ
| sexage educ
-------------+------------------
sexage | 1.0000
educ | 0.0664 1.0000
The correlation between educ and
sexage is 0.0664, which is a weak
correlation, but suggests that the more years of education an individual has,
the higher the age at first sex. However, we do not know to what extent
education makes a difference, we just know that it is positively associated with
age of first sex. For further understanding this relationship, we need to
estimate the regression of sexage on education.
We accomplish this by typing:
reg sexage educ
Source | SS df MS Number of obs = 2005
-------------+------------------------------ F( 1, 2003) = 8.87
Model | 72.6737934 1 72.6737934 Prob > F = 0.0029
Residual | 16405.025 2003 8.19022714 R-squared = 0.0044
-------------+------------------------------ Adj R-squared = 0.0039
Total | 16477.6988 2004 8.22240457 Root MSE = 2.8619
------------------------------------------------------------------------------
sexage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .055436 .0186102 2.98 0.003 .0189386 .0919333
_cons | 17.95426 .177452 101.18 0.000 17.60625 18.30227
------------------------------------------------------------------------------
Do you remember how to interpret these results? Lets review the basic
regression equation:
Y = a + bX
In our case, this equation becomes:
(predicted sexage) = 17.95426 +
.055436(educ)
We can immediately interpret the slope coefficient for education as the number
or fraction of years that the age of first sex would decrease by, for an additional
year of education. Judging from the size of the t-value (2.98), we can
tell that the coefficient is statistically significantly different from zero.
The constant, as discussed before, reflects the value of the dependent variable
Y when the independent variables are equal to zero. While this property is
technically useful in the calculation of the regression coefficients and
calculation of predicted Y values, its actual value is not always of use.
Obviously we do not want to ignore it, but we also do not need to dwell on it
since it is often not very interpretable. In our current case, it literally says
that when education level is zero, predicted age of first sex is 17.95426. The
constant is significantly different from zero, as indicated by the t-stat. If,
however, we had centered our education variable around the sample's education
mean, then the "zero" value would actually be the average level of education.
Interpreting the constant in that case would be more useful. Moving along, the
R-squared for this regression tells us that education accounts for less than 4%
of the variation around the mean of sexage. Although we would caution not to fall into the trap of
maximizing the R-squared when we are running regressions, we would probably all
agree that this regression with such a low R-squared is not picking up any
strong linear relationships between education and age of first sex. If we leave
the analysis at that, what implications might this apparent lack of relationship
have for government policy towards HIV/AIDS prevention and control?
Lets now try graphing the regression equation:
predict fsexhat
graph twoway scatter sexage educ || line fsexhat educ, ylabel(0(5)40)
ytick(0(5)40) xlabel(0(5)25) xtick(0(5)25)

Figure 1
Issues of Parsimony and Saturation
When thinking about introducing variables into a model, it is important to
keep the notions of parsimony and saturation in mind. That is, we should always
strive to include ONLY the variables that make sense and that are efficient at
capturing the desired social phenomenon. Model building is often a balancing act
between parsimony and saturation. When we say that a model is "saturated," we
mean that the model has too many variables - it is over specified. A model that
is over specified or saturated can often predict each case in the sample
perfectly because the model is using up all the degrees of freedom. Therefore,
when selecting variables for a model, it is prudent to only include the most
necessary variables or risk over specifying the model. With that in mind, lets
proceed.
Introducing a third variable
At this point, we can consider including our first control variable. It is
likely that the age at first sex is not only dependent on years of education,
but also on age. By including age in our model, we acknowledge that
sexage
is also a function of age. It is important to include this factor because
perhaps the effect of education on choices about when to first have sex differs
across cohorts. If you remember our earlier discussion on how to interpret
coefficients, each coefficient in a regression model is a partial effect,
meaning that the coefficient reflects the effect of a variable while controlling
for the others at 0. In this case it means that when we include
age, our
coefficient for educ will be the
effect of education while controlling for age at 0. Do not think of zero in
literal terms, we are not saying that the coefficient of education is the value
for a newborn (age 0), but rather think of this "controlling" as the process by
which we standardize the effect across all observations (who may have very
different levels of education). Enough theory, let's try running the multiple
regression model now:
reg sexage educ age
Source | SS df MS Number of obs = 2005
-------------+------------------------------ F( 2, 2002) = 116.46
Model | 1717.23701 2 858.618505 Prob > F = 0.0000
Residual | 14760.4617 2002 7.37285801 R-squared = 0.1042
-------------+------------------------------ Adj R-squared = 0.1033
Total | 16477.6988 2004 8.22240457 Root MSE = 2.7153
------------------------------------------------------------------------------
sexage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .1343502 .0184308 7.29 0.000 .0982047 .1704957
age | .0837989 .0056109 14.94 0.000 .0727951 .0948027
_cons | 14.59331 .2810493 51.92 0.000 14.04213 15.14449
-----------------------------------------------------------------------------
Compare our old equation (from above):
(predicted sexage) = 17.95426 + .055436(educ)
--> {R-squared = 0.0044}
To our new multiple regression equation:
(predicted sexage) = 14.59331 + .1343502(educ) + .0837989(age)
--> {R-squared = 0.1042}
Right away we should notice the effect that age has on our model. Notice that
the effect of education, controlling for age, is more strongly positive now:
this means that for a given age, individuals with more education tend to have
first sex at older ages (by .1343502 of a year more). Another way of thinking
about these new results is that in the initial model, the "true" effect of education
was being masked by the effect of age - which we did not include in the simple
regression set-up. Since the coefficient on education increased when we included
age, the relationship between age and education was negative - the older an
individual is, the more likely they are to have less education (you can check
this simply, by running the correlation of education against age).
The R-squared has also increased to 10%, implying that the variation in
education and age is enough to explain 10% of the variation in
sexage, in our sample.
The addition of a single regressor to the bivariate model probably does not seem
that difficult, but as we move forward, you will realize that this is merely the
tip of the iceberg.
Now that you have been introduced to multiple regression, try the following exercises:
- What is the relationship between the number of births that a woman has had, her years of education and the age at first sex? Use regression analysis to answer this question and, as we did at the end of module 6, restrict your analysis to the sample of women who are at least 40 years of age.
- Question 1 Answer
DUMMY VARIABLES
Thus far we have focused on using continuous variables in our regressions. We
can extend regression analysis to include categorical variables such as gender,
general satisfaction, urban area etc. But how do you include variables whose
values are arbitrary? Can we calculate the average gender of a country? How
about the average urban setting? The answer is no, but lets find out how these
types of variables are useful in regression analysis.
What Makes a Dummy Variable a "Dummy" variable?
No, "dummy" variables are not "stupid" variables, in fact they are quite smart
and useful! A dummy variable has two properties that make it a "dummy variable."
First, it is categorical and non-ordinal (i.e., categories have no rank order).
Thus, the number values associated with each category serve only to identify the
various groups/categories it represents, but not to assign value or order to any
one category. The second, and this is what makes a dummy variable a "dummy
variable," is that it is binary in the sense that it has only two values - 0 and
1. Technically, a variable like literacy or
location, may have more than 0 and 1
values, but when this type of dummy variable is used in a regression,
coefficients are calculated for each category while all the other categories are
equal to zero. Thus, if done correctly, even a multi-category variable can be
used as a dummy variable because in the end, it is broken up into 0s and 1s.
Dummy variables are useful because they allow us to control for membership
within a particular category or group. If we neglected to split a categorical
variable into several dummy variables when using it in a regression, we would
get invalid results because regression analysis assumes variables to be
continuous unless told otherwise. Therefore, if you include a categorical
variable like gender into a regression,
Stata (or any other statistical program) would recognize it as simply another
variable and would not realize that those numbers have no mathematical meaning -
Stata does not know if the values in a
variable are arbitrary or not. Regression analysis revolves around the use of
means and standard deviations, but with categorical variables, means and
standard deviations have no meaning.
How NOT to use Categorical variables
Lets try the following example of what NOT to do. Let's continue with our
previous example of the effect of education on age of first sex. This time let's
include literacy in the regression model without considering the fact that it is
a categorical variable. We might think that literacy matters separately from
education, as not all individuals with the same level of education are
necessarily literate to the same degree. First, lets tabulate literacy to see
its categories:
. tab literacy, missing
literacy | Freq. Percent Cum.
----------------------+-----------------------------------
Reads easily | 1,708 71.83 71.83
Reads with difficulty | 348 14.63 86.46
Does not read | 319 13.41 99.87
. | 3 0.13 100.00
----------------------+-----------------------------------
Total | 2,378 100.00
Let's regress it now:
reg sexage educ literacy
Source | SS df MS Number of obs = 2002
-------------+------------------------------ F( 2, 1999) = 6.62
Model | 108.385409 2 54.1927043 Prob > F = 0.0014
Residual | 16356.6076 1999 8.182395 R-squared = 0.0066
-------------+------------------------------ Adj R-squared = 0.0056
Total | 16464.993 2001 8.22838231 Root MSE = 2.8605
------------------------------------------------------------------------------
sexage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .0757481 .0209941 3.61 0.000 .0345756 .1169207
literacy | .3551987 .1694974 2.10 0.036 .0227887 .6876088
_cons | 17.35601 .336485 51.58 0.000 16.69611 18.01591
-----------------------------------------------------------------------------
After reviewing these results, how would you interpret the literacy
coefficient? Would it make sense to say that for every unit increase in
literacy, while controlling for age and education (educ), there is a
.3551987
increase in age at first sex? The answer is NO. This is similar to saying that
the average literacy in Botswana is 1.41. What would 1 unit of literacy mean?
Your guess is as good as mine.
The Correct Way
Let's try this same example, except this time we'll do it correctly. To do this we
need to call upon a few of our newly found skills. First, we need to split the
literacy variable into multiple dummy variables. There are two main ways to
accomplish this task. Here we will cover the more familiar way (tab
varname, gen(varname)) and then below you will
be introduced to a new command that will make it easier - the xi command. We
covered this first command in an earlier session:
tab literacy, gen(litid)
[Note: litid will be automatically numbered with sequential numbers]
Then we tabulate our new litid variables to make sure the command worked by
typing:
tab1 litid1 litid2 litid3
[Note: tab1 tells Stata to tabulate each variable separately instead of
cross tabulating all of them together in one big matrix]
tab1 litid1 litid2 litid3
-> tabulation of litid1
literacy==R |
eads easily | Freq. Percent Cum.
------------+-----------------------------------
0 | 667 28.08 28.08
1 | 1,708 71.92 100.00
------------+-----------------------------------
Total | 2,375 100.00
-> tabulation of litid2
literacy==R |
eads with |
difficulty | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,027 85.35 85.35
1 | 348 14.65 100.00
------------+-----------------------------------
Total | 2,375 100.00
-> tabulation of litid3
literacy==D |
oes not |
read | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,056 86.57 86.57
1 | 319 13.43 100.00
------------+-----------------------------------
Total | 2,375 100.00
Great, our command worked as it should. Each new litid variable is coded as 1
for all people who are of that degree of literacy, and 0 for everyone else. For
example, there are 348 individuals who read with difficulty, and 2027
individuals who don't.
Now it's time to run the regression with our newly created dummy variables. We
do this by typing:
reg sexage age educ litid2 litid3
Source | SS df MS Number of obs = 2002
-------------+------------------------------ F( 4, 1997) = 58.38
Model | 1723.72804 4 430.932011 Prob > F = 0.0000
Residual | 14741.265 1997 7.38170504 R-squared = 0.1047
-------------+------------------------------ Adj R-squared = 0.1029
Total | 16464.993 2001 8.22838231 Root MSE = 2.7169
------------------------------------------------------------------------------
sexage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0830834 .0056512 14.70 0.000 .0720005 .0941663
educ | .1426391 .0204928 6.96 0.000 .1024496 .1828287
litid2 | .2159715 .1955753 1.10 0.270 -.1675816 .5995246
litid3 | .0076977 .4722996 0.02 0.987 -.9185539 .9339493
_cons | 14.51124 .2956527 49.08 0.000 13.93142 15.09106
------------------------------------------------------------------------------
Our new regression line can be stated as:
(predicted sexage) = 14.51124 + .1426391(educ) + .0830834(age)
+.2159715(litid2)+ .0076977(litid3)
By now, you should be able to interpret the basic regression equation. This new
equation is simply an extension of the first regression equation discussed
earlier. Let's quickly review it. This equation tells us that for every
additional year of education, age at first sex increases by .1426391 of a year,
while controlling for age and literacy. It also tells us that for every
additional year of age, sexage increases by about .0830834 while controlling for
education and literacy. Now, the literacy coefficients tell us that for litid2
(reads with difficulty) there is an added effect of .2159715 of a year over the
omitted category (litid1 - reads easily) while controlling for education and
age. Similarly, for litid3 (does not read) there is an added effect of .0076977
over individuals who are fully literate, while controlling for education and
age.
In general, the litid coefficients show us the effect that literacy has on the
age at first sex, after controlling for education and age. None of the literacy
effects are significantly different from zero - which implies that there are no
significant differences in the age at first sex between individuals at different
levels of literacy.
Omitted/Reference Categories
There is one important point to keep in mind when interpreting a multiple
regression that uses dummy variables. Notice that only 2 litid dummy variables
were included in the equation. Why would this be necessary? It is necessary
because if we were to include all three dummy variables, we would essentially
over specify the model, which we do not want to do. Whenever we use dummy
variables, there should always be an omitted category (also known as the
reference category), in this case the omitted category is literacy (litid1).
Being "omitted" does not mean that the equation is ignoring that group of
people, rather we are telling Stata to only explicitly show us the coefficients
for litid2 and litid3. In fact, the coefficient for the omitted category
(litid1) can be known from the results above. If you remember our description of
what the constant is, you will realize that litid1 can be derived from it. The
constant in this case is analogous to a "reservoir" of values, in which all
omitted categories get lumped into. Therefore, if the constant represents the
value of our dependent variable Y when all other regressors are equal to zero,
that means that the "left over" values are used to calculate the constant (in
this case those values are those not in the category litid2 or
litid3 ). And who
is not in the litid2 or
litid3 categories? Correct,
litid1 (fully literate
individuals).
It is important to realize that we did not drop any cases by omitting the
litid1
category, we simply "shifted" them into the constant and used them as a
comparison group. If we were using another set of dummy variables,
gender for
example, we would have to choose the reference category for that variable as
well. If we chose men as our reference category, we would get a coefficient for
women, but not for men. The coefficient for men would be found in the constant.
If both gender and literacy were included in a regression model as dummy
variables, two omitted categories would be captured and represented by the
constant - in our case it would have been literate males.
A Short Cut: The "xi" Option
Although the
tab varname, gen(varname) command is useful in creating dummy
variables, it is unnecessary. Stata provides us with an easier and more
convenient short-cut to specify a categorical variable in a regression equation.
The xi command tells Stata to treat the specified variable(s) as categorical -
as if they were dummy variables. This command can be used with any Stata command
like regress, logistic, probit, etc. Let's try it.
First, we will create and label a new gender variable that is consistent with
dummy variable coding - 0s and 1s. Note however, that we could also use the
xi
command for gender, but we choose not to.
tab gender
tab gender, nol
recode gender 1=0 2=1
label def gender 1 "Female" 0 "Male"
label val gender gender
We have recoded and relabeled the gender variable as 0=male and 1=female.
Now we move on to using the xi command. We continue with our
sexage and
education example, but now we will be controlling for age,
literacy, and gender.
By doing so, we are stating not only that age of first sex depends on education,
but also on age, gender, and
literacy. This time, however, we will be declaring
the 'does not read' (literacy ==3) as the reference category. We do this by
prefacing the regress command with the char
varname[omit] statement. This command is useful when using
xi because
Stata, by default, selects the first category in the specified variable as the
reference category. In our model, the xi: command works by placing it at the
beginning of the regression equation and then specifying the variables you want
Stata to expand into its constituent categories by "tagging" them with an "i."
in front of each target variable. See below:
char literacy[omit] 3
xi:reg sexage educ age i.literacy i.gender
Notice that an "i." is included for the variables
literacy and gender. Also
remember that we have told Stata to treat category 3 of the
literacy variable as
the reference category and since we have not specified a specific reference
category for gender, Stata will omit its first category - 0, men.
Source | SS df MS Number of obs = 2002
-------------+------------------------------ F( 5, 1996) = 47.55
Model | 1752.59423 5 350.518847 Prob > F = 0.0000
Residual | 14712.3988 1996 7.37094127 R-squared = 0.1064
-------------+------------------------------ Adj R-squared = 0.1042
Total | 16464.993 2001 8.22838231 Root MSE = 2.7149
------------------------------------------------------------------------------
sexage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .1380934 .0206063 6.70 0.000 .0976813 .1785055
age | .0835832 .0056527 14.79 0.000 .0724973 .0946691
_Iliteracy_1 | .0351196 .4724508 0.07 0.941 -.8914288 .9616681
_Iliteracy_2 | .2043912 .4818247 0.42 0.671 -.7405408 1.149323
_Igender_1 | -.247329 .1249804 -1.98 0.048 -.4924347 -.0022234
_cons | 14.65515 .5151142 28.45 0.000 13.64493 15.66537
------------------------------------------------------------------------------
What do the results tell us? Right away we should be able to tell that our
model explains over 10% of the variation around our independent variable. Next,
we should notice that the coefficient for gender is negative. This tells us that
in relation to the omitted category (gender=0 - men) everyone within the
reported category (women) has a lower age of first sex than the reference group,
after controlling for all other variables! Overall, the model tells us that if
we know a person's level of education, their age, their gender, and literacy, we
are likely to guess their age of first sex 10.6% better than simply guessing the
mean sexage in the sample.
Let's consider what our new equation looks like:
(predicted sexage) = 14.65515 + .1380934(educ) +.0835832(age) +
.0351196(literate=1, else=0) + .2043912(reads with difficulty=1, else=0)
-.247329(female=1, else=0)
The new equation allows us to calculate, for example, the predicted age of first
sex for a 50 year women who has 10 years of education but reads with difficulty,
or the age of first sex for a 25 year old literate man with 16 years of
education. All we need to do is plug in the number of years of education, the
age, and either a 1 or a 0 for whether the person falls within the particular
category or not. Let's try it.
(50 yr old woman with 10 yrs of ed who reads with difficulty: predicted first
sex) = 14.65515 + .1380934(educ) +.0835832(age) + .0351196(literate=1, else=0) +
.2043912(reads with difficulty=1, else=0) -.247329(female=1, else=0)
= 14.65515 + .1380934(10) +.0835832(50) + .0351196(0) + .2043912(1)
-.247329(1)
--> ANSWER = 20.17231
For a 25 year old literate with 16 years of education, the predicted equation is
the following:
(25 yr old literate man with 16 years of ed: predicted sexage) = 14.65515 +
.1380934(educ) +.0835832(age) + .0351196(literate=1, else=0) + .2043912(reads
with difficulty=1, else=0) -.247329(female=1, else=0)
= 14.65515 + .1380934(16) +.0835832(25) + .0351196(1) + .2043912(0) -.247329(0)
--> ANSWER = 18.98934
What, if anything, do these predicted values assume? Any ideas? How about
assuming that each of the non-categorical variables in our equation have a
linear relationship with the dependent variable? Does it make sense that older
individuals are likely to uniformly be having first sex at older ages? We might
think that the relationship between cohort and first sex is non-linear: that as
you look at successively younger cohorts, the age at first sex declines, but not
linearly. Perhaps this age at first sex falls at a slower and slower rate. We
will learn how to control for this curvilinear effect later in this section.
Note on Extrapolating Beyond the Data
Let's try calculating the following predicted
sexage:
What is the predicted age at first sex for a 90 year old illiterate male with 20
years of education? We can easily carry out the calculations for this question:
(predicted sexage) = 14.65515 + .1380934(20) +.0835832(90) + .0351196(0) +
.2043912(0) -.247329(0)
-> (predicted sexage) = 14.65515 + .1380934(20) +.0835832(90)
= ANSWER = 24.97463
Do you see any problems with this example? Does our age variable include people
over the age of 64? NO. Extrapolating beyond the available data points is never
a good idea because our results apply only to the specific cases used to
calculate the model. It is possible that our observed relationship holds for 90
year olds with 20 years of education, but it is also possible that it does not.
The point is that without those actual cases in the calculation of the model it
is impossible to know. Therefore, we suggest that you never try to extrapolate,
predict values, beyond the data points used in the model.
Try these exercises to make sure you understand the basics of interpreting
dummy variables in multiple regression analysis.
- 2. What is the predicted age of first sex for a 30 year old literate woman, with 5 years of education?
- Question 2 Answer
- 3. What is the predicted age of first sex for a 45 year old man, who reads with difficulty, and has 5 years of education?
- Question 3 Answer
INTERACTIONS WITH DUMMY VARIABLES
Thus far, we have only dealt with the additive effects of dummy variables.
That is, the assumption has been that for each independent variable Xi,
the amount of change in our dependent variable Y is the same, regardless of the
values of the other independent variables in the equation. This assumption
allows us to interpret the partial coefficients as the effect of a variable
while controlling for the other independent variables in the model.
The additive assumption, however, does not always hold. In such cases, the
partial effect of a given independent variable cannot be interpreted as the
effect of the variable while all others are being held constant, instead these
peculiar relationships depend on the specific values of other independent
variables in the model. In these cases it is hypothesized that the independent
variable Xi is linearly related to the dependent variable Y, however,
that linear relationship depends on a different independent variable in the
model. Interactions are perhaps best visualized and understood in the case of
dummy variables.
For instance, in our example below, we interact the categories of education and
gender. In effect, what we are testing with an interacted model is whether or
not the linear relationship between an independent variable Xi and
the dependent variable Y is dependent on the values of a different independent
variable in the model. More intuitively, by interacting education and gender, we
are testing whether the effect of education on the age of first sex is different
for men than for women.
In general, we can illustrate what we mean by the additive effect of dummy
variables in regression with the graph below. Each category of an independent
dummy variable has a slope as depicted by the lines in the graph. For instance,
we can imagine the predicted effect of education on sexage looking like the
lines below. As it stands, this first graph suggests that the effect of gender
is similar across all education levels, the only apparent difference is in
magnitude between males and females -- both slopes are identical for each unit
change in Xi. In the graph, Y = sexage, X1 = education,
and the coefficients b1 are for education, and b2 for
gender.

Graph 1
In the second graph, we find a hypothetical interaction effect. We can imagine this effect to be similar in form to that of the interaction between education and gender. That is, the effect of education (slope of the line) depends on the particular gender of the individual. In this case, we find that the upper-most line on the graph has a steeper slope than the line below it, thus the effect of education depends on the value of Xi -- in this case, the gender of the individual.

Graph 2
Let's now investigate how the theory
measures up to empirical findings. Creating an interaction term with Stata is as
easy as inserting an asterisk "*" between the two variables you wish to
interact. In essence, this tells Stata to multiply these two variables together.
Or, you can also generate each interaction term independently by generating a
variable that multiplies the two desired variables together. In the immediate
example below, we use the easiest of these two approaches, but to see the second
approach
click here.
First, we choose to use "Does not read at all" as the reference category
(literacy==3).
Then, to interact education and gender, we simply include an asterisk between
education & i.gender.
char literacy[omit] 3
xi:reg sexage age i.gender*educ
i.literacy
Source | SS df MS Number of obs = 2002
-------------+------------------------------ F( 6, 1995) = 44.03
Model | 1925.50914 6 320.91819 Prob > F = 0.0000
Residual | 14539.4839 1995 7.28796184 R-squared = 0.1169
-------------+------------------------------ Adj R-squared = 0.1143
Total | 16464.993 2001 8.22838231 Root MSE = 2.6996
------------------------------------------------------------------------------
sexage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0874985 .005678 15.41 0.000 .0763631 .098634
_Igender_1 | -1.799912 .3421138 -5.26 0.000 -2.47085 -1.128975
educ | .0589642 .0261485 2.25 0.024 .007683 .1102455
_IgenXeduc_1 | .1735204 .0356236 4.87 0.000 .1036571 .2433836
_Iliteracy_1 | -.028932 .469968 -0.06 0.951 -.9506115 .8927474
_Iliteracy_2 | .1402214 .479286 0.29 0.770 -.7997321 1.080175
_cons | 15.3257 .5303835 28.90 0.000 14.28553 16.36586
------------------------------------------------------------------------------
As with the previous regression results, we find coefficients for the main
effects of educ,
age, _Iliteracy_1,
_Iliteracy_2, and
_Igender_1, but now we also find the interaction effects of
years of education and gender (_IgenXeduc_1).
When interpreting interaction effects, it is important to keep in mind that the
main effect for the variables that were interacted are no longer "available" for
interpretation. That is, interaction effects supersede the original main effects
and thus render them obsolete, however, we still use them to calculate any
estimated yhat value. For example, if we were interested in calculating the
sexage for a literate female aged 35 with a 12 year level of education, we
compute the following:
predicted sexage = 15.3257 + 35(.0874985 ) +
-1.799912(1) + 12(.0589642) + 12(.1735204) + 1(-.028932) + 0(.1402214 )
predicted sexage =19.349116
- 4. How would you interpret the interaction effect?
- Question 4 Answer
LINEAR TRANSFORMATIONS OF NON-LINEAR RELATIONSHIPS
Thus far, we have assumed linear relationships for all of our regression
models. In fact, a linear relationship is a basic requirement for regression
analysis. Empirically, however, variables are often not associated in a linear
fashion. Yet this reality hardly precludes regression analyses from accurately
predicting and describing real world phenomenon. In this section we will show
you two basic approaches to achieving that. By using a quadratic term or by
taking the natural logarithm of a term we can transform non-linear relationships
into approximately linear and vastly improve the fit of a regression line.
Note: Logarithmic and Quadratic transformations are not restricted to multiple
regression, however, we have placed them in the multiple regression module
because they are rather advanced topics and should only be addressed after one
has a clear understanding of all of the material in all lessons prior to this
section.
Transformations using Squared Terms
An often used squared transformation is the square of
age. Researchers often
include both age and
age2 in regression models because it allows the effect of
one-year increase in age to change as a person gets older. That is, the effect of
age is
not likely to remain the same as we get older. By including
age2, the effect of
age is allowed to vary across years of age.
gen age2=age*age
regress sexage age
predict yhat1, xb
line sexage yhat1 age, sort

Figure 2
regress sexage age age2
predict yhat2, xb
line sexage yhat2 age, sort

Figure 3
This graph allows us to see the effect of the squared term -
age2.
How would we interpret the output from a regression of sexage on
age and
age2, among other variables?
char literacy[omit] 3
xi:reg sexage age age2 i.gender*educ
i.literacy
Source | SS df MS Number of obs = 2002
-------------+------------------------------ F( 7, 1994) = 45.34
Model | 2260.85646 7 322.979494 Prob > F = 0.0000
Residual | 14204.1366 1994 7.12343859 R-squared = 0.1373
-------------+------------------------------ Adj R-squared = 0.1343
Total | 16464.993 2001 8.22838231 Root MSE = 2.669
------------------------------------------------------------------------------
sexage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .3039383 .0320409 9.49 0.000 .2411012 .3667754
age2 | -.0029846 .000435 -6.86 0.000 -.0038377 -.0021315
_Igender_1 | -1.766521 .3382652 -5.22 0.000 -2.429912 -1.103131
educ | .0483948 .0258975 1.87 0.062 -.0023942 .0991839
_IgenXeduc_1 | .1665963 .0352336 4.73 0.000 .0974977 .2356949
_Iliteracy_1 | .0235548 .464696 0.05 0.960 -.8877857 .9348954
_Iliteracy_2 | .1645597 .4738585 0.35 0.728 -.7647501 1.093869
_cons | 11.90497 .7235444 16.45 0.000 10.48599 13.32395
------------------------------------------------------------------------------
In terms of our coefficients, we find that each year of education increases
age of first sex by 0.05 of a year; that age increases sexage up to the age of
50.917 and thereafter decreases them (because quadratic ax2 + bx + c
turns over at x = -b/2a, which for our age and age2 coefficients is -.3039383
/(2 x -.002984 ) = 50.917).
Transformations Using the Natural Logarithm
Often it is desirable to run a regression using the natural logarithm (to the
base e) of a variable instead of the variable itself. For instance, if the graph
of the dependent variable on the independent variable shows that the
relationship is not linear, making one or both of the variables logarithmic can
sometimes produce a linear relationship. Therefore, although a linear
relationship might not exist between between two variables, a linear
relationship might exist between the natural logarithms of the two variables.
Logarithmic transformation also lessens the influence of outliers (which can
sometimes drastically affect the slope of the regression line) because the
natural logarithm of a variable is much less sensitive to extreme observations
than is the variable itself.
As an aside: Income is a variable that is often transformed using its natural
log, although we are not fortunate enough to have income as a variable in this
data. When we do the log transformation, the impact of each additional dollar
decreases as income increases. That is, after a certain point more money does
not make that much more of difference. For example, earning 2 billion pula a
year versus earning 3 billion pula will probably not have as much of an effect
on how many beers we drink, but earning only 100 pula per year versus 1000 pula
is likely make a huge difference.
EXAMPLE: FURTHER EXPLORING STD SYMPTOMS VARIABLES
Now that we have some background in multiple regression, let's look at another example in more detail. Information about the observable symptoms of STD's is important for individuals to have, as it can help them to know when it is necessary to seek treatment for themselves. It is also crucial to be able to recognize these symptoms in one's sexual partners, in order that relevant protection measures can be chosen. Finally, it has been found that individuals are also more likely to contract HIV when they have other STD's, than when they don't.
In module 5, we constructed a composite measure of knowledge about the signs of STD's, using questions Q404 and Q405 in the questionnaire: wscore and mscore. We scored individuals on how many answers they volunteered. Let's investigate the information that people have about the signs of STD's in a man. Do you think men or women are likely to score better on this measure? Let's open the data, and tab to find out:
keep if rec_per==1 replace gender=0 if gender==2 lab def gender 0 "Female" 1 "Male" lab val gender gender
egen mscore=robs(stdsign1m-stdsign11m)
egen wscore=robs(stdsign1w-stdsign11w)
tab mscore gender, row
| sex of respondent
mscore | Female Male | Total
-----------+----------------------+----------
0 | 1,030 800 | 1,830
| 56.28 43.72 | 100.00
-----------+----------------------+----------
1 | 341 239 | 580
| 58.79 41.21 | 100.00
-----------+----------------------+----------
2 | 400 363 | 763
| 52.42 47.58 | 100.00
-----------+----------------------+----------
3 | 282 286 | 568
| 49.65 50.35 | 100.00
-----------+----------------------+----------
4 | 96 102 | 198
| 48.48 51.52 | 100.00
-----------+----------------------+----------
5 | 29 44 | 73
| 39.73 60.27 | 100.00
-----------+----------------------+----------
6 | 7 10 | 17
| 41.18 58.82 | 100.00
-----------+----------------------+----------
7 | 7 10 | 17
| 41.18 58.82 | 100.00
-----------+----------------------+----------
8 | 6 13 | 19
| 31.58 68.42 | 100.00
-----------+----------------------+----------
9 | 2 14 | 16
| 12.50 87.50 | 100.00
-----------+----------------------+----------
10 | 3 18 | 21
| 14.29 85.71 | 100.00
-----------+----------------------+----------
11 | 2 9 | 11
| 18.18 81.82 | 100.00
-----------+----------------------+----------
Total | 2,205 1,908 | 4,113
| 53.61 46.39 | 100.00
Note that at the lower scores, the female proportion is larger than the male proportion, while at the higher scores, the ranking reverses. Men appear to score higher than women on this question. What is the difference in mean scores between men and women, on this question?
sort gender
by gender: sum mscore
_______________________________________________________________________________ -> gender = Female
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
mscore | 2205 1.235828 1.514505 0 11
_______________________________________________________________________________ -> gender = Male
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
mscore | 1908 1.619497 2.009124 0 11
The females have a lower average score than men. We could also have done this another way:
tab gender, sum(mscore) tab gender, sum(wscore)
. tab gender, sum(mscore)
sex of | Summary of mscore
respondent | Mean Std. Dev. Freq.
------------+------------------------------------
Female | 1.2358277 1.5145048 2205
Male | 1.6194969 2.0091238 1908
------------+------------------------------------
Total | 1.4138099 1.7714565 4113
. tab gender, sum(wscore)
sex of | Summary of wscore
respondent | Mean Std. Dev. Freq.
------------+------------------------------------
Female | 1.3981859 1.4908056 2205
Male | 1.2368973 1.8666789 1908
------------+------------------------------------
Total | 1.3233649 1.677408 4113
So, women seem to have higher average knowledge scores than men on the question about signs of STD in women, and lower knowledge scores than men on the question about signs of STD in men. Is this knowledge gap correlated with any other individual-level variables?
corr mscore age if gender==0
| mscore age
-------------+------------------
mscore | 1.0000
age | 0.1229 1.0000
corr mscore age if gender==1
| mscore age
-------------+------------------
mscore | 1.0000
age | 0.2079 1.0000
More knowledge is associated with being older, but more strongly for women than for men. What about education?
corr mscore educ if gender==0
| mscore educ
-------------+------------------
mscore | 1.0000
educ | 0.3082 1.0000
corr mscore educ if gender==1
| mscore educ
-------------+------------------
mscore | 1.0000
educ | 0.3120 1.0000
It's good to see that more education is correlated with a higher score on the knowledge test for signs of STD in men! This linear relationship is slightly stronger for men than for women, although this might be because men get more education than women. Could we find out whether men obtain more education than women on average?
tab gender, sum(educ)
| Summary of number of years in
sex of | school
respondent | Mean Std. Dev. Freq.
------------+------------------------------------
Female | 8.1032122 3.0844942 1899
Male | 8.0471272 3.6845547 1549
------------+------------------------------------
Total | 8.0780162 3.3669331 3448
We find that this suspicion is not confirmed; men have slightly less education on average, than women. To investigate the effect of education on mscore without the contaminating effects of gender, we need to run a multiple regression. We need to be able to control for the effects of gender when examining the effect of education on the score. What other variables do you think would be important for explaining an individual's score on this question?
We will include gender, age, age-squared, education, location and whether you have had any information about HIV.
replace hivinfo=. if hivinfo==7 replace hivinfo=0 if hivinfo==2 lab def yesno 0 "NO" 1 "YES" lab val hivinfo yesno
gen age2=age*age
xi: reg mscore age age2 educ gender i.location hivinfo
Source | SS df MS Number of obs = 3236
-------------+------------------------------ F( 7, 3228) = 73.07
Model | 1445.79312 7 206.541874 Prob > F = 0.0000
Residual | 9124.31411 3228 2.82661528 R-squared = 0.1368
-------------+------------------------------ Adj R-squared = 0.1349
Total | 10570.1072 3235 3.26742109 Root MSE = 1.6813
------------------------------------------------------------------------------
mscore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .07016 .0120684 5.81 0.000 .0464974 .0938226
age2 | -.0007176 .0001849 -3.88 0.000 -.0010802 -.0003551
educ | .1290284 .0099732 12.94 0.000 .1094739 .1485828
gender | .5141404 .0599439 8.58 0.000 .3966085 .6316723
_Ilocation_2 | -.1008049 .0802571 -1.26 0.209 -.2581648 .0565551
_Ilocation_3 | .0487271 .0718908 0.68 0.498 -.0922292 .1896833
hivinfo | .3013267 .0674745 4.47 0.000 .1690296 .4336239
_cons | -1.1434 .1776206 -6.44 0.000 -1.491661 -.7951396
------------------------------------------------------------------------------
Here, it appears that age, education, gender and hivinfo are all significantly and positively related to how much you know about signs of STD in a man. Being a man increases your score by 0.52 points, while having had some information about HIV increases your score by almost 0.3 points. Living in an urban village is associated with a reduction in your score, although this coefficient is not statistically different from zero.
Do you think that someone who reports having an unusual discharge in the past 12 months would be likely to get a higher than average or lower than average score on this question? We can test this hypothesis, by including a dummy for discharge. In addition, it's plausible that someone who has heard about STD's is also more likely to score better on the question. By including a dummy variable for std, we can check whether this is the case.
replace std=. if std==7 replace std=0 if std==2 lab val std yesno
replace discharge=0 if discharge==2 lab val discharge yesno
xi: reg mscore age age2 educ gender i.location hivinfo std discharge
Source | SS df MS Number of obs = 2352
-------------+------------------------------ F( 9, 2342) = 72.86
Model | 1664.55046 9 184.950051 Prob > F = 0.0000
Residual | 5945.02054 2342 2.53843747 R-squared = 0.2187
-------------+------------------------------ Adj R-squared = 0.2157
Total | 7609.571 2351 3.23673799 Root MSE = 1.5932
------------------------------------------------------------------------------
mscore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0050802 .0156432 0.32 0.745 -.0255956 .0357561
age2 | .0000858 .0002187 0.39 0.695 -.0003431 .0005147
educ | .0874022 .0105325 8.30 0.000 .0667483 .1080562
gender | .6330814 .067364 9.40 0.000 .5009821 .7651808
_Ilocation_1 | -.0790373 .0793899 -1.00 0.320 -.234719 .0766445
_Ilocation_2 | -.1604419 .0838224 -1.91 0.056 -.3248158 .003932
hivinfo | .2933726 .0781833 3.75 0.000 .1400569 .4466882
std | 1.572655 .1209486 13.00 0.000 1.335478 1.809833
discharge | .4511467 .1486412 3.04 0.002 .1596646 .7426287
_cons | -1.104994 .2149919 -5.14 0.000 -1.526588 -.6833998
------------------------------------------------------------------------------
The R-squared increases in this regression, meaning that we explain more of the variation in the mscore variable using the set of variables including std and discharge, than using the set of variables excluding these variables. In fact, individuals who report they have heard of STD's increase their scores by over 1.5 points, relative to those who have not heard of STD's before. The two new variables are also both significant at the 1% level.
It is possible that the effect of some of the X-variables on your score is different, whether you are male or female, and whether we are considering the variable wscore or mscore. Let's create an interaction term between gender and education, to deal with this possibility for one X variable:
gen interact1=gender*educ
xi: reg mscore age age2 educ gender i.location hivinfo std discharge interact1
Source | SS df MS Number of obs = 2352
-------------+------------------------------ F( 10, 2341) = 65.66
Model | 1666.79608 10 166.679608 Prob > F = 0.0000
Residual | 5942.77492 2341 2.53856255 R-squared = 0.2190
-------------+------------------------------ Adj R-squared = 0.2157
Total | 7609.571 2351 3.23673799 Root MSE = 1.5933
------------------------------------------------------------------------------
mscore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0050771 .0156435 0.32 0.746 -.0255995 .0357537
age2 | .0000912 .0002188 0.42 0.677 -.0003378 .0005203
educ | .0971307 .0147624 6.58 0.000 .068182 .1260795
gender | .7833041 .1733459 4.52 0.000 .4433767 1.123232
_Ilocation_1 | -.0817018 .0794423 -1.03 0.304 -.2374865 .0740829
_Ilocation_2 | -.1608324 .0838255 -1.92 0.055 -.3252124 .0035476
hivinfo | .293136 .0781856 3.75 0.000 .1398157 .4464562
std | 1.57158 .120957 12.99 0.000 1.334386 1.808774
discharge | .4491987 .1486593 3.02 0.003 .1576811 .7407162
interact1 | -.0176671 .0187841 -0.94 0.347 -.0545024 .0191681
_cons | -1.189945 .2331991 -5.10 0.000 -1.647243 -.7326465
------------------------------------------------------------------------------
Here, the interaction term is negative, meaning that the total impact of an extra year of education on your score if you are male is .0971307 -.0176671 = .0794636. Thus, an extra year of education adds more to a woman's score than a man's score, in the question about signs of STD's in men.
Do you think we would observe the reverse relationship if we were investigating wscore?
xi: reg wscore age age2 educ gender i.location hivinfo std discharge interact1
Source | SS df MS Number of obs = 2352
-------------+------------------------------ F( 10, 2341) = 51.40
Model | 1241.22005 10 124.122005 Prob > F = 0.0000
Residual | 5653.21362 2341 2.41487126 R-squared = 0.1800
-------------+------------------------------ Adj R-squared = 0.1765
Total | 6894.43367 2351 2.93255367 Root MSE = 1.554
------------------------------------------------------------------------------
wscore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0026582 .0152577 0.17 0.862 -.0272618 .0325781
age2 | .0000892 .0002134 0.42 0.676 -.0003293 .0005077
educ | .1022932 .0143983 7.10 0.000 .0740585 .1305279
gender | .3103542 .16907 1.84 0.067 -.0211883 .6418968
_Ilocation_1 | -.0108336 .0774828 -0.14 0.889 -.1627756 .1411084
_Ilocation_2 | -.1916175 .0817578 -2.34 0.019 -.3519428 -.0312923
hivinfo | .2382015 .076257 3.12 0.002 .0886632 .3877399
std | 1.438455 .1179734 12.19 0.000 1.207112 1.669798
discharge | .4702784 .1449924 3.24 0.001 .1859516 .7546052
interact1 | -.0492751 .0183208 -2.69 0.007 -.0852018 -.0133485
_cons | -.8026062 .2274468 -3.53 0.000 -1.248624 -.356588
------------------------------------------------------------------------------
The effect of one more year of education on your score if you are a man is thus = .1022932 -.0492751 = .0530181, which is still lower than the 0.10 point increase in score for a woman with one more year of education. The marginal effect of a year's worth of education on the knowledge of men about signs of STD's in males and females is lower than the marginal effect of a year's worth of education on the knowledge of women about these signs.
Sometimes, researchers think that the marginal effects of all variables on the dependent variable are likely to be different for men and women. We can generate the relevant coefficients for this flexible functional form by creating interaction terms for every variable, and including them in the regression as well. However, this is a very long-winded way to proceed, and hinders interpretation, so instead we will run our original regression over the separate samples of men and women:
For the question about signs of STD's in women:
(A)
xi: reg wscore age age2 educ i.location hivinfo std discharge if gender==0
Source | SS df MS Number of obs = 1360
-------------+------------------------------ F( 8, 1351) = 45.26
Model | 668.796355 8 83.5995443 Prob > F = 0.0000
Residual | 2495.24409 1351 1.84696083 R-squared = 0.2114
-------------+------------------------------ Adj R-squared = 0.2067
Total | 3164.04044 1359 2.32821225 Root MSE = 1.359
------------------------------------------------------------------------------
wscore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0021618 .017557 -0.12 0.902 -.0366037 .0322801
age2 | .0002212 .0002444 0.91 0.366 -.0002582 .0007005
educ | .1104107 .0133668 8.26 0.000 .0841887 .1366327
_Ilocation_1 | .0087054 .0915275 0.10 0.924 -.170846 .1882568
_Ilocation_2 | -.0558838 .0912835 -0.61 0.541 -.2349566 .123189
hivinfo | .1474282 .0860742 1.71 0.087 -.0214254 .3162819
std | 1.3919 .1417423 9.82 0.000 1.113841 1.669959
discharge | .5356415 .1569342 3.41 0.001 .2277803 .8435027
_cons | -.8038681 .2487702 -3.23 0.001 -1.291886 -.3158503
------------------------------------------------------------------------------
(B)
xi: reg wscore age age2 educ i.location hivinfo std discharge if gender==1
Source | SS df MS Number of obs = 992
-------------+------------------------------ F( 8, 983) = 22.63
Model | 576.990063 8 72.1237579 Prob > F = 0.0000
Residual | 3132.96861 983 3.18715016 R-squared = 0.1555
-------------+------------------------------ Adj R-squared = 0.1487
Total | 3709.95867 991 3.74365153 Root MSE = 1.7853
------------------------------------------------------------------------------
wscore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0121779 .0270297 0.45 0.652 -.0408647 .0652206
age2 | -.0001408 .0003807 -0.37 0.711 -.0008878 .0006062
educ | .0470636 .0160778 2.93 0.003 .0155129 .0786143
_Ilocation_1 | -.0307476 .133258 -0.23 0.818 -.2922505 .2307552
_Ilocation_2 | -.4094038 .1521044 -2.69 0.007 -.7078905 -.1109171
hivinfo | .3612295 .139291 2.59 0.010 .0878876 .6345714
std | 1.484963 .1992645 7.45 0.000 1.093931 1.875996
discharge | .394767 .2834684 1.39 0.164 -.1615057 .9510398
_cons | -.5658047 .3518007 -1.61 0.108 -1.256171 .1245619
------------------------------------------------------------------------------
For the question about signs of STD's in men:
(C)
xi: reg mscore age age2 educ i.location hivinfo std discharge if gender==0
Source | SS df MS Number of obs = 1360
-------------+------------------------------ F( 8, 1351) = 34.61
Model | 562.25936 8 70.28242 Prob > F = 0.0000
Residual | 2743.78476 1351 2.03092876 R-squared = 0.1701
-------------+------------------------------ Adj R-squared = 0.1652
Total | 3306.04412 1359 2.43270354 Root MSE = 1.4251
------------------------------------------------------------------------------
mscore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0127081 .0184106 -0.69 0.490 -.0488246 .0234084
age2 | .0003497 .0002562 1.36 0.173 -.000153 .0008523
educ | .1121646 .0140167 8.00 0.000 .0846677 .1396615
_Ilocation_1 | -.0442112 .0959776 -0.46 0.645 -.2324926 .1440701
_Ilocation_2 | -.0103502 .0957218 -0.11 0.914 -.1981296 .1774293
hivinfo | .2862898 .0902592 3.17 0.002 .1092263 .4633533
std | 1.192142 .148634 8.02 0.000 .9005638 1.483721
discharge | .4550357 .1645645 2.77 0.006 .132206 .7778654
_cons | -.7638876 .2608656 -2.93 0.003 -1.275633 -.2521419
------------------------------------------------------------------------------
(D)
xi: reg mscore age age2 educ i.location hivinfo std discharge if gender==1
Source | SS df MS Number of obs = 992
-------------+------------------------------ F( 8, 983) = 39.63
Model | 1009.16253 8 126.145316 Prob > F = 0.0000
Residual | 3128.98969 983 3.18310243 R-squared = 0.2439
-------------+------------------------------ Adj R-squared = 0.2377
Total | 4138.15222 991 4.17573382 Root MSE = 1.7841
------------------------------------------------------------------------------
mscore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0266837 .0270126 0.99 0.323 -.0263253 .0796926
age2 | -.0002282 .0003804 -0.60 0.549 -.0009748 .0005183
educ | .0632624 .0160675 3.94 0.000 .0317318 .094793
_Ilocation_1 | -.184016 .1331734 -1.38 0.167 -.4453528 .0773207
_Ilocation_2 | -.4350578 .1520078 -2.86 0.004 -.7333549 -.1367607
hivinfo | .2756452 .1392025 1.98 0.048 .0024769 .5488135
std | 1.96974 .1991379 9.89 0.000 1.578956 2.360524
discharge | .4981576 .2832883 1.76 0.079 -.0577618 1.054077
_cons | -.7945734 .3515772 -2.26 0.024 -1.484501 -.1046453
------------------------------------------------------------------------------
Let's concentrate on the variables which are statistically significant at the 1% or 5% level. Here, education has a much smaller effect on the score of men than women, for both of the score variables. This confirms what we saw earlier in the model with one interaction term for gender*education: that the marginal effect of education on the male scores is smaller than the marginal effect of that same year of education for the female scores.
Having heard about STD's (std) increases the scores of men more than women when it comes to information about the symptoms in both sexes (compare equations (C) with (D) and (A) with (B)). Perhaps this implies that the men receive better quality of information than women. It might also imply that the form in which information about STD's was conveyed allowed men to more easily absorb these facts.
However, being a male living in an urban village significantly reduces your score on both variables, whereas this informational gap between urban village-dwelling and other-dwelling individuals does not seem to be present for women.
This set of regressions that we have run indicates that many variables could have different effects for different groups of individuals: in this case, men and women. Sometimes, these differences may be captured in interaction terms, whereas at other times, we may want to specify completely separate models for each of these groups.
EXERCISES
- Is the correlation between age and age at first sex different for individuals with different religious beliefs?
- Exercise 1 Answer
- What is the correlation between level of education and whether an individual has ever heard or seen information about HIV?
- Exercise 2 Answer
- What is the average difference in the age of first partner for men and women? Think about how you would answer this question using the tab, sum command and using simple regression?
- Exercise 3 Answer
- What is the relationship between the number of people in the household and the number of rooms in the house? If you had to run a regression (using the entire data set, not just those individuals in the individual questionnaire), what would be your dependent variable? Why?
- Exercise 4 Answer
- Let's suppose that the number of people in a family determines the size of the house that it lives in. If so, an additional person is likely to make a family acquire how many more rooms (on average)? Please show the graph for this relationship.
- Exercise 5 Answer
- How is the size of a house (in terms of number of rooms) affected by family size and whether the head of the household works or not? Do these variables significantly explain/predict changes in total school expenditure? Why or why not? [You will have to create a variable for 'whether the head of the household works', and remember to keep only one observation per household in your regression.]
- Exercise 6 Answer
| BACK TO TOP |