TABLE OF CONTENTS
Introduction
Correlation of Variables
Outliers
Simple Regression
Understanding Regression Output Tables
Graphing the Regression Equation
Putting It All Together
Exercises
INTRODUCTION
In Module 5, we learned methods using Stata that allowed us to
determine whether two variables were statistically related or independent of one
another. While this is indeed important, it is often necessary to take our
analysis a few steps further to determine the actual relationship between
variables.
In this module, we will cover the first two methods commonly used to determine
the relationship between two variables. The first is correlation analysis, which
simply measures the strength or degree of association between two continuous
variables. The second is simple regression analysis, which allows us to
determine how one variable changes in relation to the change in another
variable. We will also look at multivariate regression, which lets us explain
how one variable changes in response to a change in another variable, keeping
all other relevant variables constant.
In general in regression analysis, we are interested in causal relationships:
whether variable X has an effect on variable Y. As such, it is often useful to
think of variable X as the "independent" or "explanatory" variable and to think
of variable Y as the "dependent" variable or as the "effect".
To motivate the questions and examples in this module we will focus on a
specific policy topic.
The policy questions which this module will focus on deal with risky sexual
behavior: specifically, the age at first sex. The UN AIDS report on Botswana
points out that indicators of sexual behavior amongst young people are
particularly important for AIDS programs, as these individuals are more amenable
to behavioral change than adults. In constructing a baseline picture of behavior
in Botswana in 2001, we might want to know what variables are important in
affecting the choices young people make about when to start having sex, or how
long to remain celibate for. For example, we could be interested in:
- whether the age at first sex has been increasing or decreasing in successive cohorts
- whether more highly educated people have sex earlier or later
- whether literacy (which enables the person to read information about HIV/AIDS and STD's rather than just hearing this information from others) affects the age at first sex.
In this module, we will concentrate on the relationship between age at first sex
(sexage), and
age. Take a minute to think about
these two variables. Which one do you think is the independent variable? How
about the dependent variable? Remember that the independent variable is the
variable that is likely to "cause" or help "explain" the dependent variable. In
this case, we are predicting that the age that you chose to start having sex at
depends on your age currently, or rather your age cohort. It is certainly
plausible that individuals in different cohorts (age groups) face different
'norms' in terms of when is an acceptable age to start having sex. In addressing
questions about whether young people are changing behavior in response to
increased awareness about HIV/AIDS, we might want to know whether younger
cohorts have increased the age at which they first have sex. It is of course
much less convincing that the age at which you first had sex affects your
current age cohort - thus, age
is our independent variable, and sexage
is our dependent variable.
For now, lets concentrate on the first method we mentioned, correlation
analysis. Then we will proceed on to simple regression.
CORRELATION OF VARIABLES
Consider this statement: "Someone who is currently 50 years old probably had sex
first when they were much older than someone who is currently 30 years old." In
certain contexts - e.g. 30 years ago, in societies with very conservative norms
about appropriate sexual behavior - this might be a reasonable statement to
make. However, being the researchers that we are, we want to confirm our
intuition with empirical facts. Since we are dealing with two continuous
variables and we presume a linear relationship, the appropriate measure of
association is a Pearson correlation, which in Stata we perform with the
correlate command (or
corr
for short).
The Pearson correlation measures the
degree to which variables are related or in other words, the degree to which
they co-vary. When using correlation in our analysis, we must make the
assumption that the relationship between our two variables is linear. If we
suspect otherwise, we should make the proper adjustments to the variable that
does not meet the assumption (we will cover this in more detail later). Overall,
the initial use of the correlate command in Stata is a good way to start
investigating whether your intuition about a relationship is remotely correct.
What would you expect about the relationship between age at first sex and
current age, or age group? Will the relationship be strongly positive, strongly
negative, or very weak? Will the relationship be linear or non-linear?
Make sure that you have opened the BAIS data file. Now,
corr sexage age
Stata produces the following results:
. corr sexage age
(obs=2380)
| sexage age
-------------+------------------
sexage | 1.0000
age | 0.2828 1.0000
What does the output mean? A correlation value can range from from -1 to +1,
with 0 indicating that there is no linear association and ±1 being a perfect
linear association. Technically speaking, if the correlation value is low (near
0), it does not necessarily mean that there is no association whatsoever, but
rather that there is no LINEAR association.
A correlation value of 0.2828 as in our results above, is positive but fairly
weak. This means that the linear association between our two variables is not
very strong. As the values of age increase, so
do the sexage values. More clearly, older
individuals are positively associated with first sexual intercourse happening at
a slightly older age. The interpretation that older generations have more
conservative norms about what is appropriate sexual behavior seems to be
somewhat borne out in this result, although the linearity of this relationship
is admittedly weak.
The correlation approach is a very simplistic initial approach to investigating
the statement: "Someone who is currently 50 years old probably had sex first
when they were much older than someone who is currently 30 years old." Our
initial study of the matter suggests that this is likely to be true, according
to our Stata correlation estimate. Stata can do much more. We can go further and
figure out by exactly how much current age influences the age at first sex. To
do this, we will call upon the regress command or (reg
for short).
Before we move one, however, try the following questions:
- What does it mean when two variables render a correlation of 0.5000?
- Question 1 Answer
- What is the correlation between age at first sex and years of education?
- Question 2 Answer
OUTLIERS
Before we continue on to simple regression analysis, it is a good idea to spend
a few minutes reviewing the issue of outliers again, as well as the
often-encountered issue of answers coded as 'not applicable' or 'did not
respond'.
Firstly, as a matter of cleaning the data, we need to be sure that there are no
answers coded as negative values or very high values - these are often for 'not
applicable' or 'did not respond' answers. If you get your data in raw format,
you are bound to run into many instances of such coding. If observations such as
these are not set to missing, their values will disturb any relationship we want
to measure.
Secondly, we must be extremely mindful of possible outliers and their adverse
effects on the relationship we observe between two continuous variables. This is
particularly true when using methods that rely on the mean of any given
variable, as is the case in both correlation and regression analysis. If we
remember from an earlier module, means are extremely sensitive to outliers,
whether positively or negatively skewed. Therefore, we will spend some time
investigating how our two variables age
and sexage are distributed. The quickest
method to accomplish that is to graph these variables in one scatter plot. Let's
try it:
keep if rec_per==1
First, let's use the command above to
restrict the data to only the respondents who have answered the individual
questionnaire. This will simplify things for us. Now we are free to create our
scatter plot using:
scatter sexage age

Figure 1
Or, we can get a bit more sophisticated and try a few new options:
scatter sexage age , ylabel(5(5)55)
ytick(5(5)55) xlabel(5(5)64) xtick(5(5)64)

Figure 2
Both scatter plots display the same information, however, the second one gives us a better description. The additional options:
[ylabel(5(5)55) ytick(5(5)55) xlabel(5(5)64) xtick(5(5)64)]
tell Stata to plot the variables together in a scatter plot
graph and to give us a more detailed display of the y- and x-axes. From the
additional information provided by this new graph, we can quickly see that most
data points are clustered together.
However, there is at least one dot which seems out-of-place: that is the one at
about 17x, 52y. This is probably a miscoded piece of information. Note that logically, it should not be possible to record
an age at first sex greater than your current age.
To see the effect of this observation on the mean of sexage, type:
sum sexage
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
sexage | 2380 18.69958 3.282816 7 51
The answer given is 18.69958. We will see that this statistic is not
exactly correct, because of the outlier observation. Note that if we had a set
of observations coded as 99 (e.g. that code could be for 'Did not answer'),
these observations would similarly disturb the mean of sexage and we would have
to set them to missing before continuing.
Now, we can consider removing the sets of more obvious outliers. Since each of
these outliers is probably moving the mean away from the median, we will remove
these cases and recalculate the graphs and the correlation between sexage and
age. These observations, as well as the ones which have missing values, will
provide us with no extra information about the relationship between age and
sexage. To clean these variables, type:
replace sexage=. if sexage>age
Now it is important to remember that after we are done with this exercise, you must reload the original data set to
recover these dropped cases. Unless you want to permanently keep these changes
you should NOT save the data over the original data file.
Now we can proceed with the calculations. To do so we type:
corr sexage age
| sexage age
-------------+------------------
sexage | 1.0000
age | 0.2945 1.0000
Then we type:
scatter sexage age , ylabel(5(5)50)
ytick(5(5)50) xlabel(5(5)64) xtick(5(5)64)

Figure 3
The new results are slightly stronger than the previous correlation, but not
substantially so. The scatter plot indicates a strong relationship, although it
does not seem to be entirely linear. Do you think we have cleaned up enough of
the outliers?
To check what the mean of sexage looks like with these observations removed,
type:
sum sexage
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
sexage | 2378 18.68587 3.216644 7 50
The answer here is now 18.68587 which is slightly lower than before.
Formally testing for outliers
Some fields in social research suggest and embrace an active approach to the
handling of outliers, whereas others, take a more hands off approach. Neither
approach is superior to the other; after all, both are efforts to minimize
the effects of extreme values. On one hand, the aggressive approach chooses to
control for the ill effects by eliminating cases from the models. Whereas the
hands-off approach, often chooses to use more robust estimation procedures which
can handle extreme values in the data.
For our purposes, we will only eliminate the missing values as well as the most
obvious outliers for two reasons: 1) an in depth study of how to formally handle
outliers is beyond the scope of this course, and 2) we advocate the use of more
robust procedures to handle possible outliers, however, those procedures are
also beyond the scope of this course. Therefore, we will stay on the middle
ground and only eliminate the most obvious outliers for our regression models.
SIMPLE REGRESSION
Simple OLS regression (Ordinary Least Square regression), is a procedure that
determines the best fitting regression line between two variables. In essence,
the OLS regression line reduces the sum of squared errors to a minimum between
two variables. It is beyond the scope of this website to teach you the finer
points and intricacies of regression analysis; however, we will provide useful
examples to give you a feel for what it is in general. Our main purpose here
will be to show you how to use Stata to calculate the regression line between
two variables and how to interpret the results. If you are not clear on what
exactly regression is or would like to have a deeper understanding of it, we
suggest that you take a course in statistics as it relates to your field of
interest.
In general, the simplest relationship between an independent and dependent
variable can be expressed in the linear formula,
Y = a + bX
where Y is the dependent variable and X is the independent variable. The
coefficient "b" is referred to as the slope and tells us how a 1 unit change in
X will change the value of Y. The coefficient "a" tells us the value of Y when
the independent variable X is zero. On an X-by-Y graph, the coefficient "a" is
where the regression line intercepts with the y-axis.
In the case of sexage and
age, the equation can be written as
follows:
sexage = a + b(age)
This equation suggests that there is a linear relationship between our two
variables. If we were to find a positive b coefficient, our equation would
suggest that as age increases by one unit there will be a corresponding change
(b) in age at first sex (sexage); if we find a negative b coefficient, our equation will
suggest that as age increases by 1 unit, there will be a corresponding decrease
in age at first sex.
First RE-OPEN the data in its original format:
use bais.dta, clear
keep if rec_per==1
Remember that we can type help and whatever command to learn more about that
command Stata. If we type
help regress
we will get a full description of the regression command, its options and its
syntax. For our purposes we need to type:
reg sexage age
Stata gives us the results table below:
Source | SS df MS Number of obs = 2380
-------------+------------------------------ F( 1, 2378) = 206.79
Model | 2051.1011 1 2051.1011 Prob > F = 0.0000
Residual | 23587.0985 2378 9.91888078 R-squared = 0.0800
-------------+------------------------------ Adj R-squared = 0.0796
Total | 25638.1996 2379 10.7768809 Root MSE = 3.1494
------------------------------------------------------------------------------
sexage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0778464 .0054135 14.38 0.000 .0672308 .088462
_cons | 16.09526 .1922678 83.71 0.000 15.71823 16.47229
------------------------------------------------------------------------------
UNDERSTANDING REGRESSION OUTPUT TABLES
What do all these numbers mean? The output shown here is actually three tables
in one. There is a small table in the upper left, a list of information in the
upper right, and a larger table across the bottom. The smaller table in the
upper left hand corner is called the analysis of variance (ANOVA) table.
Although we are not particularly interested in this portion of the results, you
can learn more about it, if interested click here.
For the purposes of understanding the basic relationship between
sexage and
age, we will focus on three pieces
of information provided by the output above. First, let's remember the basic
linear regression equation:
Y = a + bX
or in our case:
sexage = a + b(age)
If we plug the results into their appropriate spot in the equation, we get:
(predicted sexage)i = 16.09526 + .0778464(age)i
In actual words, this equation is telling us that for every one unit increase in
age, age at first sex (sexage)
will increase by about 0.08 of 1 year. This increase is statistically
significant as indicated by the 0.000 probability associated with this
coefficient. In addition, the constant (_cons) tells us that when our
independent variable age equals zero, age at first sex is 16.095. The other
important piece of information is the R-squared (r2) which equals 0.08. In
essence, this value tells us we can account for about 8% of the variation around
the mean of sexage with the
age variable. If you are interested in knowing what
all the other output means, click
here.
The Case of Simple Regression
Now we can use this formula to make actual predicted estimates of
sexage for any given value of
age.
So, looking at the regression results table above, we arrive at:
(predicted sexage)i = 16.09526 + .0778464(age)i
What does this equation really tell us? What if we were interested in estimating
at what age a current 40 year old first had sex at? Using the equation above, we
plug in 40 for (age)i and solve for the resulting (predicted sexage)i.
In this case, one predicts that an individual currently aged 40 years would have
had sex first at the age of 19.2091. It is important to realize that a
regression equation will never perfectly fit the observed values. Therefore, the
estimated value of sexage that our calculation predicts is just that, a
prediction. That is why we place the word predict in front of the dependent
variable sexage.
A useful step after any regression equation is to create a variable in Stata
that equals the predicted value of your dependent variable given your
independent variable(s). We use the predict command to estimate each predicted
sexage. The predict command must be specified directly after the regression
command. Thus, we would type the following:
reg sexage age
predict fsexhat
Note that we named our new variable fsexhat,
which includes the suffix "hat" as part of the new name. This is a common
practice because the hat sign, ^, in regression equations, is often used to
indicate estimated values. Let's see what this new estimated variable looks like.
Type:
list sexage fsexhat
Here is a partial view of what the resulting table should look like:
+-------------------+
| sexage fsexhat |
|-------------------|
1. | . 19.59835 |
2. | 22 18.89773 |
3. | . 20.45466 |
4. | . 19.67619 |
5. | . 17.49649 |
|-------------------|
6. | . 17.3408 |
7. | . 17.10726 |
8. | 15 18.11926 |
9. | 17 18.19711 |
10. | 19 18.3528 |
|-------------------|
11. | . 18.11926 |
12. | 18 18.89773 |
13. | 18 17.57434 |
14. | . 18.04142 |
15. | 20 18.5085 |
|-------------------|
16. | 19 17.80788 |
17. | . 17.49649 |
18. | 21 18.43065 |
19. | 22 18.43065 |
20. | 30 19.83188 |
|-------------------|
You can readily see that none of our predictions were correct. Nevertheless, the regression results tell us that by knowing an individual's age, we can guess that person's value for sexage by 8% better than simply guessing 18.69 - the sample mean for age.
Before we move on, let's try another practice question:
- How much would you expect the age of first sex to change by as education increases?
- Question 3 Answer
GRAPHING REGRESSION EQUATIONS
Having obtained a predicted value of the dependent variable
sexage, we can plot this relation
with the scatterplot graphing command. In this instance, the command would be:
graph twoway scatter sexage age || line fsexhat
age, ylabel(5(5)55) ytick(5(5)55) xlabel(5(5)64) xtick(5(5)64)

Figure 4
As you can see, the above graph is very similar to the scatter plots above. The
difference now though, is that we have a regression line.
Do you see a problem here? Remember our conversation about outliers? Let's put
all of our newly acquired knowledge to use.
PUTTING IT ALL TOGETHER
If we look over our notes from above, we should only drop the most obvious
outliers. First, lets reload our data to make sure we have all the original
cases, and then clean up the outliers. Remember also that we only want to keep
information from people who answered the individual level questionnaire. We can
do this by typing:
use bais.dta, clear
keep if rec_per==1
replace sexage=. if sexage>age
Then let's calculate the new regression line. What difference does the removal
of the outliers make? Let's find out.
reg sexage age
predict fsexhat
Source | SS df MS Number of obs = 2378
-------------+------------------------------ F( 1, 2376) = 225.71
Model | 2133.62661 1 2133.62661 Prob > F = 0.0000
Residual | 22460.7186 2376 9.45316441 R-squared = 0.0868
-------------+------------------------------ Adj R-squared = 0.0864
Total | 24594.3452 2377 10.3468007 Root MSE = 3.0746
------------------------------------------------------------------------------
sexage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0794569 .0052888 15.02 0.000 .0690857 .0898281
_cons | 16.0266 .1879012 85.29 0.000 15.65814 16.39507
------------------------------------------------------------------------------
Removing the outliers has made some difference. The new R-squared
increases from 8% to almost 9% (still not very high, which we expected given the
correlation analysis we performed above). Our new slope is also more accurate
now that it is not being unduly biased by the large outliers. The constant is
slightly lower at 16.0266.
Lets now plot the new graph using our newly constructed
fsexhat variable:
scatter sexage age || lfit fsexhat age ,
ylabel(5(5)50) ytick(5(5)50) xlabel(5(5)64) xtick(5(5)64)

Figure 5
Spend a few minutes studying these new results. What did we learn about the
relationship between age at sexage and
age? How did it change from before? Is this
the best we can do?
In fact, this is not the best we can do! Can you think of any other extraneous
variables that could play a significant role in this bivariate relationship? How
about gender? How about the level of education a person has, or whether they are
literate (because some people may not have completed schooling yet)? Each of
those variables is likely to alter our initial simple regression relationship
between sexage and
age, because this relationship
is likely to depend on gender, education, literacy and other variables.
This type of analysis, however, requires more than simple regression between two
variables, it requires what is known as multiple regression. We turn to multiple
regression next. But first, try your new knowledge on the following exercise
questions below. Make sure you understand simple regression before you move on
to the more complex multiple regression.
EXERCISES
So now that we have learned quite a bit about regression analysis, its time to put our knowledge to the test!1.) What is the correlation between how old a person is and their level of education? Are older people more or less educated than younger people?
2.) We might think that those who have more years of education would choose to postpone the age of first sex, as they are more likely to be aware of various STD's and other risks. Investigate this hypothesis, by regressing sexage on educ. Is there a statistically significant relationship between age of first sex and level of education?