Module 1: Introduction to Surveys
Module 2: Getting Started with Stata
Module 3: Understanding Distributions
Module 4: Measures of Central Tendency
Module 5: Bivariate Analysis
Module 6: Simple Regression Analysis
Module 7: Multiple Regression Analysis
Module 8: Discrete Outcome Analysis
Graphing with Stata 8

SIMPLE REGRESSION ANALYSIS

 

TABLE OF CONTENTS

Introduction
Correlation of Variables
Outliers
Simple Regression
Understanding Regression Output Tables
Graphing the Regression Equation
Putting It All Together
Exercises

 

 

 

 

 

 

INTRODUCTION

In Module 5, we learned methods using Stata that allowed us to determine whether two variables were statistically related or independent of one another. While this is indeed important, it is often necessary to take our analysis a few steps further to determine the actual relationship between variables.

In this module, we will cover the first two methods commonly used to determine the relationship between two variables. The first is correlation analysis, which simply measures the strength or degree of association between two continuous variables. The second is simple regression analysis, which allows us to determine how one variable changes in relation to the change in another variable. We will also look at multivariate regression, which lets us explain how one variable changes in response to a change in another variable, keeping all other relevant variables constant.

In general in regression analysis, we are interested in causal relationships: whether variable X has an effect on variable Y. As such, it is often useful to think of variable X as the "independent" or "explanatory" variable and to think of variable Y as the "dependent" variable or as the "effect".

To motivate the questions and examples in this module we will focus on a specific policy topic. The policy questions which this module will focus on deal with risky sexual behavior: specifically, the age at first sex. The UN AIDS report on Botswana points out that indicators of sexual behavior amongst young people are particularly important for AIDS programs, as these individuals are more amenable to behavioral change than adults. In constructing a baseline picture of behavior in Botswana in 2001, we might want to know what variables are important in affecting the choices young people make about when to start having sex, or how long to remain celibate for. For example, we could be interested in:

  1. whether the age at first sex has been increasing or decreasing in successive cohorts
  2. whether more highly educated people have sex earlier or later
  3. whether literacy (which enables the person to read information about HIV/AIDS and STD's rather than just hearing this information from others) affects the age at first sex.

In this module, we will concentrate on the relationship between age at first sex (sexage), and age. Take a minute to think about these two variables. Which one do you think is the independent variable? How about the dependent variable? Remember that the independent variable is the variable that is likely to "cause" or help "explain" the dependent variable. In this case, we are predicting that the age that you chose to start having sex at depends on your age currently, or rather your age cohort. It is certainly plausible that individuals in different cohorts (age groups) face different 'norms' in terms of when is an acceptable age to start having sex. In addressing questions about whether young people are changing behavior in response to increased awareness about HIV/AIDS, we might want to know whether younger cohorts have increased the age at which they first have sex. It is of course much less convincing that the age at which you first had sex affects your current age cohort - thus, age is our independent variable, and sexage is our dependent variable.

For now, lets concentrate on the first method we mentioned, correlation analysis. Then we will proceed on to simple regression.

 

 

CORRELATION OF VARIABLES

Consider this statement: "Someone who is currently 50 years old probably had sex first when they were much older than someone who is currently 30 years old." In certain contexts - e.g. 30 years ago, in societies with very conservative norms about appropriate sexual behavior - this might be a reasonable statement to make. However, being the researchers that we are, we want to confirm our intuition with empirical facts. Since we are dealing with two continuous variables and we presume a linear relationship, the appropriate measure of association is a Pearson correlation, which in Stata we perform with the correlate command (or corr for short).

The Pearson correlation measures the degree to which variables are related or in other words, the degree to which they co-vary. When using correlation in our analysis, we must make the assumption that the relationship between our two variables is linear. If we suspect otherwise, we should make the proper adjustments to the variable that does not meet the assumption (we will cover this in more detail later). Overall, the initial use of the correlate command in Stata is a good way to start investigating whether your intuition about a relationship is remotely correct.

What would you expect about the relationship between age at first sex and current age, or age group? Will the relationship be strongly positive, strongly negative, or very weak? Will the relationship be linear or non-linear?

Make sure that you have opened the BAIS data file. Now,

corr sexage age

Stata produces the following results:

. corr sexage age
(obs=2380)

             |   sexage      age
-------------+------------------
      sexage |   1.0000
         age |   0.2828   1.0000

What does the output mean? A correlation value can range from from -1 to +1, with 0 indicating that there is no linear association and ±1 being a perfect linear association. Technically speaking, if the correlation value is low (near 0), it does not necessarily mean that there is no association whatsoever, but rather that there is no LINEAR association.

A correlation value of 0.2828 as in our results above, is positive but fairly weak. This means that the linear association between our two variables is not very strong. As the values of age increase, so do the sexage values. More clearly, older individuals are positively associated with first sexual intercourse happening at a slightly older age. The interpretation that older generations have more conservative norms about what is appropriate sexual behavior seems to be somewhat borne out in this result, although the linearity of this relationship is admittedly weak.

The correlation approach is a very simplistic initial approach to investigating the statement: "Someone who is currently 50 years old probably had sex first when they were much older than someone who is currently 30 years old." Our initial study of the matter suggests that this is likely to be true, according to our Stata correlation estimate. Stata can do much more. We can go further and figure out by exactly how much current age influences the age at first sex. To do this, we will call upon the regress command or (reg for short).

Before we move one, however, try the following questions:

  1. What does it mean when two variables render a correlation of 0.5000?
  2. Question 1 Answer
  3. What is the correlation between age at first sex and years of education?
  4. Question 2 Answer

 

 

 

OUTLIERS

Before we continue on to simple regression analysis, it is a good idea to spend a few minutes reviewing the issue of outliers again, as well as the often-encountered issue of answers coded as 'not applicable' or 'did not respond'.

Firstly, as a matter of cleaning the data, we need to be sure that there are no answers coded as negative values or very high values - these are often for 'not applicable' or 'did not respond' answers. If you get your data in raw format, you are bound to run into many instances of such coding. If observations such as these are not set to missing, their values will disturb any relationship we want to measure.

Secondly, we must be extremely mindful of possible outliers and their adverse effects on the relationship we observe between two continuous variables. This is particularly true when using methods that rely on the mean of any given variable, as is the case in both correlation and regression analysis. If we remember from an earlier module, means are extremely sensitive to outliers, whether positively or negatively skewed. Therefore, we will spend some time investigating how our two variables age and sexage are distributed. The quickest method to accomplish that is to graph these variables in one scatter plot. Let's try it:
 

keep if rec_per==1   

First, let's use the command above to restrict the data to only the respondents who have answered the individual questionnaire. This will simplify things for us. Now we are free to create our scatter plot using:
 

scatter sexage age


                                                     Figure 1

 

Or, we can get a bit more sophisticated and try a few new options:

scatter sexage age , ylabel(5(5)55) ytick(5(5)55) xlabel(5(5)64) xtick(5(5)64)



                                                    Figure 2

Both scatter plots display the same information, however, the second one gives us a better description. The additional options:

[ylabel(5(5)55) ytick(5(5)55) xlabel(5(5)64) xtick(5(5)64)]

tell Stata to plot the variables together in a scatter plot graph and to give us a more detailed display of the y- and x-axes. From the additional information provided by this new graph, we can quickly see that most data points are clustered together.

However, there is at least one dot which seems out-of-place: that is the one at about 17x, 52y. This is probably a miscoded piece of information. Note that logically, it should not be possible to record an age at first sex greater than your current age.

To see the effect of this observation on the mean of sexage, type:

sum sexage
 

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      sexage |    2380    18.69958   3.282816          7         51

The answer given is 18.69958. We will see that this statistic is not exactly correct, because of the outlier observation. Note that if we had a set of observations coded as 99 (e.g. that code could be for 'Did not answer'), these observations would similarly disturb the mean of sexage and we would have to set them to missing before continuing.

Now, we can consider removing the sets of more obvious outliers. Since each of these outliers is probably moving the mean away from the median, we will remove these cases and recalculate the graphs and the correlation between sexage and age. These observations, as well as the ones which have missing values, will provide us with no extra information about the relationship between age and sexage. To clean these variables, type:

replace sexage=. if sexage>age
 

Now it is important to remember that after we are done with this exercise, you must reload the original data set to recover these dropped cases. Unless you want to permanently keep these changes you should NOT save the data over the original data file.

Now we can proceed with the calculations. To do so we type:

corr sexage age

             |   sexage      age
-------------+------------------
      sexage |   1.0000
         age |   0.2945   1.0000

 

Then we type:

scatter sexage age , ylabel(5(5)50) ytick(5(5)50) xlabel(5(5)64) xtick(5(5)64)



                                                Figure 3

The new results are slightly stronger than the previous correlation, but not substantially so. The scatter plot indicates a strong relationship, although it does not seem to be entirely linear. Do you think we have cleaned up enough of the outliers?

To check what the mean of sexage looks like with these observations removed, type:

sum sexage

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      sexage |    2378    18.68587   3.216644          7         50

 

The answer here is now 18.68587 which is slightly lower than before.

 

Formally testing for outliers

Some fields in social research suggest and embrace an active approach to the handling of outliers, whereas others, take a more hands off approach. Neither approach is superior to the other; after all, both are efforts to minimize the effects of extreme values. On one hand, the aggressive approach chooses to control for the ill effects by eliminating cases from the models. Whereas the hands-off approach, often chooses to use more robust estimation procedures which can handle extreme values in the data.

For our purposes, we will only eliminate the missing values as well as the most obvious outliers for two reasons: 1) an in depth study of how to formally handle outliers is beyond the scope of this course, and 2) we advocate the use of more robust procedures to handle possible outliers, however, those procedures are also beyond the scope of this course. Therefore, we will stay on the middle ground and only eliminate the most obvious outliers for our regression models.

 

 

SIMPLE REGRESSION

Simple OLS regression (Ordinary Least Square regression), is a procedure that determines the best fitting regression line between two variables. In essence, the OLS regression line reduces the sum of squared errors to a minimum between two variables. It is beyond the scope of this website to teach you the finer points and intricacies of regression analysis; however, we will provide useful examples to give you a feel for what it is in general. Our main purpose here will be to show you how to use Stata to calculate the regression line between two variables and how to interpret the results. If you are not clear on what exactly regression is or would like to have a deeper understanding of it, we suggest that you take a course in statistics as it relates to your field of interest.

In general, the simplest relationship between an independent and dependent variable can be expressed in the linear formula,

Y = a + bX

where Y is the dependent variable and X is the independent variable. The coefficient "b" is referred to as the slope and tells us how a 1 unit change in X will change the value of Y. The coefficient "a" tells us the value of Y when the independent variable X is zero. On an X-by-Y graph, the coefficient "a" is where the regression line intercepts with the y-axis.

In the case of sexage and age, the equation can be written as follows:

sexage = a + b(age)

This equation suggests that there is a linear relationship between our two variables. If we were to find a positive b coefficient, our equation would suggest that as age increases by one unit there will be a corresponding change (b) in age at first sex (sexage); if we find a negative b coefficient, our equation will suggest that as age increases by 1 unit, there will be a corresponding decrease in age at first sex.

First RE-OPEN the data in its original format:

use bais.dta, clear
keep if rec_per==1

Remember that we can type help and whatever command to learn more about that command Stata. If we type help regress we will get a full description of the regression command, its options and its syntax. For our purposes we need to type:

reg sexage age

Stata gives us the results table below:
 

      Source |       SS       df       MS              Number of obs =    2380
-------------+------------------------------           F(  1,  2378) =  206.79
       Model |   2051.1011     1   2051.1011           Prob > F      =  0.0000
    Residual |  23587.0985  2378  9.91888078           R-squared     =  0.0800
-------------+------------------------------           Adj R-squared =  0.0796
       Total |  25638.1996  2379  10.7768809           Root MSE      =  3.1494
------------------------------------------------------------------------------
      sexage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0778464   .0054135    14.38   0.000     .0672308     .088462
       _cons |   16.09526   .1922678    83.71   0.000     15.71823    16.47229
------------------------------------------------------------------------------

 

 

UNDERSTANDING REGRESSION OUTPUT TABLES

What do all these numbers mean? The output shown here is actually three tables in one. There is a small table in the upper left, a list of information in the upper right, and a larger table across the bottom. The smaller table in the upper left hand corner is called the analysis of variance (ANOVA) table. Although we are not particularly interested in this portion of the results, you can learn more about it, if interested click here.

For the purposes of understanding the basic relationship between sexage and age, we will focus on three pieces of information provided by the output above. First, let's remember the basic linear regression equation:

Y = a + bX                

or in our case:

sexage = a + b(age)

If we plug the results into their appropriate spot in the equation, we get:

(predicted sexage)i = 16.09526  + .0778464(age)i

In actual words, this equation is telling us that for every one unit increase in age, age at first sex (sexage) will increase by about 0.08 of 1 year. This increase is statistically significant as indicated by the 0.000 probability associated with this coefficient. In addition, the constant (_cons) tells us that when our independent variable age equals zero, age at first sex is 16.095. The other important piece of information is the R-squared (r2) which equals 0.08. In essence, this value tells us we can account for about 8% of the variation around the mean of sexage with the age variable. If you are interested in knowing what all the other output means,  click here.

 


The Case of Simple Regression

Now we can use this formula to make actual predicted estimates of sexage for any given value of age.

So, looking at the regression results table above, we arrive at:

(predicted sexage)i = 16.09526  + .0778464(age)i

What does this equation really tell us? What if we were interested in estimating at what age a current 40 year old first had sex at? Using the equation above, we plug in 40 for (age)i and solve for the resulting (predicted sexage)i. In this case, one predicts that an individual currently aged 40 years would have had sex first at the age of 19.2091. It is important to realize that a regression equation will never perfectly fit the observed values. Therefore, the estimated value of sexage that our calculation predicts is just that, a prediction. That is why we place the word predict in front of the dependent variable sexage.

A useful step after any regression equation is to create a variable in Stata that equals the predicted value of your dependent variable given your independent variable(s). We use the predict command to estimate each predicted sexage. The predict command must be specified directly after the regression command. Thus, we would type the following:

reg sexage age
predict fsexhat


Note that we named our new variable fsexhat, which includes the suffix "hat" as part of the new name. This is a common practice because the hat sign, ^, in regression equations, is often used to indicate estimated values. Let's see what this new estimated variable looks like. Type:

list sexage fsexhat

Here is a partial view of what the resulting table should look like:

      +-------------------+
      | sexage    fsexhat |
      |-------------------|
   1. |      .   19.59835 |
   2. |     22   18.89773 |
   3. |      .   20.45466 |
   4. |      .   19.67619 |
   5. |      .   17.49649 |
      |-------------------|
   6. |      .    17.3408 |
   7. |      .   17.10726 |
   8. |     15   18.11926 |
   9. |     17   18.19711 |
  10. |     19    18.3528 |
      |-------------------|
  11. |      .   18.11926 |
  12. |     18   18.89773 |
  13. |     18   17.57434 |
  14. |      .   18.04142 |
  15. |     20    18.5085 |
      |-------------------|
  16. |     19   17.80788 |
  17. |      .   17.49649 |
  18. |     21   18.43065 |
  19. |     22   18.43065 |
  20. |     30   19.83188 |
      |-------------------|

You can readily see that none of our predictions were correct. Nevertheless, the regression results tell us that by knowing an individual's age, we can guess that person's value for sexage by 8% better than simply guessing 18.69 - the sample mean for age.

Before we move on, let's try another practice question:

  1. How much would you expect the age of first sex to change by as education increases?
  2. Question 3 Answer

 

 

 

GRAPHING REGRESSION EQUATIONS

Having obtained a predicted value of the dependent variable sexage, we can plot this relation with the scatterplot graphing command. In this instance, the command would be:

graph twoway scatter sexage age || line fsexhat age, ylabel(5(5)55) ytick(5(5)55) xlabel(5(5)64) xtick(5(5)64)

 
                                                Figure 4

As you can see, the above graph is very similar to the scatter plots above. The difference now though, is that we have a regression line. Do you see a problem here? Remember our conversation about outliers? Let's put all of our newly acquired knowledge to use.
 

 

 

PUTTING IT ALL TOGETHER

If we look over our notes from above, we should only drop the most obvious outliers. First, lets reload our data to make sure we have all the original cases, and then clean up the outliers. Remember also that we only want to keep information from people who answered the individual level questionnaire. We can do this by typing:

use bais.dta, clear

keep if rec_per==1

replace sexage=. if sexage>age

Then let's calculate the new regression line. What difference does the removal of the outliers make? Let's find out.

reg sexage age

predict fsexhat

 

      Source |       SS       df       MS              Number of obs =    2378
-------------+------------------------------           F(  1,  2376) =  225.71
       Model |  2133.62661     1  2133.62661           Prob > F      =  0.0000
    Residual |  22460.7186  2376  9.45316441           R-squared     =  0.0868
-------------+------------------------------           Adj R-squared =  0.0864
       Total |  24594.3452  2377  10.3468007           Root MSE      =  3.0746
------------------------------------------------------------------------------
      sexage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0794569   .0052888    15.02   0.000     .0690857    .0898281
       _cons |    16.0266   .1879012    85.29   0.000     15.65814    16.39507
------------------------------------------------------------------------------

Removing the outliers has made some difference. The new R-squared increases from 8% to almost 9% (still not very high, which we expected given the correlation analysis we performed above). Our new slope is also more accurate now that it is not being unduly biased by the large outliers. The constant is slightly lower at 16.0266.

Lets now plot the new graph using our newly constructed fsexhat variable:

scatter sexage age || lfit fsexhat age , ylabel(5(5)50) ytick(5(5)50) xlabel(5(5)64) xtick(5(5)64)


                                            Figure 5

Spend a few minutes studying these new results. What did we learn about the relationship between age at sexage and age? How did it change from before? Is this the best we can do?

In fact, this is not the best we can do! Can you think of any other extraneous variables that could play a significant role in this bivariate relationship? How about gender? How about the level of education a person has, or whether they are literate (because some people may not have completed schooling yet)? Each of those variables is likely to alter our initial simple regression relationship between sexage and age, because this relationship is likely to depend on gender, education, literacy and other variables.

This type of analysis, however, requires more than simple regression between two variables, it requires what is known as multiple regression. We turn to multiple regression next. But first, try your new knowledge on the following exercise questions below. Make sure you understand simple regression before you move on to the more complex multiple regression.

 

 

EXERCISES

So now that we have learned quite a bit about regression analysis, its time to put our knowledge to the test!

1.) What is the correlation between how old a person is and their level of education? Are older people more or less educated than younger people?   

Exercise 1 Answer

2.) We might think that those who have more years of education would choose to postpone the age of first sex, as they are more likely to be aware of various STD's and other risks. Investigate this hypothesis, by regressing sexage on educ. Is there a statistically significant relationship between age of first sex and level of education?

Exercise 2 Answer
3.) Do those who have first sex at an earlier age have more or fewer total births? You will have to use the nobirth variable for this exercise, and restrict the sample to women only, by using the gender variable. Is there a significant relationship? If not, what other variables do you think would be important in explaining total number of births?
 
Exercise 3 Answer

4.) Now use a simple regression model to determine the relationship between the number of live births and years of education for women? What is your prior expectation on the sign of the relationship? Is your prior borne out in the data? What happens if you restrict the regression to women who are probably past childbearing age - over age 40? How much of the variation in number of births do we explain using only education, for women over age 40?

Exercise 4 Answer

 

BACK TO TOP