Module 8

 

BINARY DEPENDENT VARIABLE ANALYSIS

 

TABLE OF CONTENTS

Introduction
Linear Probability Models
Basic Issues in Discrete Outcome Analysis
Probit Models
Logit Models
Model Selection: Logit or Probit
Additional Resources
 

 

 

 

 

 

 

 

 

INTRODUCTION

As you may have noticed from looking at the questionnaires in the BAIS, many questions capture information which cannot be expressed in the form of a continuous variable. Let's open the data and the individual questionnaire, and have a look at some cases.

There are questions like:

Example 1:
Q304: "What is your relationship to [NAME OF MOST RECENT/NEXT MOST RECENT PARTNER]?"

Answer:                            ......Husband/Wife
                                        ......Girlfriend/boyfriend/not living with you
                                        ......Someone whom you paid or who paid you for sex
                                        ......Casual acquaintance
                                        ......Other


{Variable name: part1, part2, part3}
 


Example 2:
Q309: "Did you use a condom the first time you had sex with this partner?"

Answer:                             ......YES/NO/Don't recall


{Variable name: part1con1, part2con1, part3con1}

 

Example 3:
Q314: "The last time you had sex, did you or this partner do anything to delay or avoid pregnancy?"

Answer:                              ......YES/NO/Don't recall


{Variable name: part1preg, part2preg, part3preg}


Example 4:
Q604: "If a teacher has HIV/AIDS but is not sick, should she be allowed to continue teaching in school?"

Answer:                                ......YES/NO/Don't know


{Variable name: hivteach}


These variables are coded as dummy variables - as in the case of hivteach, or as categorical variables, as in the case of the part1, part2 and part3 variables. Even though they are coded numerically (check this by "tab varname, nolabel"), these numbers don't really have a natural meaning. There is, for example, no way to numerically rank the answers for any of the questions above.

Many of these types of variables are likely to answer interesting questions we might have. For example, we might care about the factors which are important for explaining why someone uses a condom the first time they have sex. Here, the dependent variable is 0=did not use a condom the first time you had sex with this partner, and 1=did use a condom. Or, we might care about the factors affecting whether a person thinks that HIV positive teachers should be allowed to continue teaching. In this case, the Y-variable ==0 for a NO response, and Y==1 for a YES  response.

Although we can identify X and Y variables which are important for answering policy-related questions, we will see shortly that the linear regression framework that we have been using to link the dependent and independent variables in a meaningful way is not really the right model to  use when we have categorical outcome variables.

Before we continue, try the next question:

1. Can you locate questions at the individual and household level which generate data in categorical variable format? Which are coded as dummy variables?
Question 1 Answer

 

As focus questions for today, we will try to understand the relationship between X variables including:
                    (i) observable features of individuals (eg: gender, age, education)
                    (ii) observable features of their homes (eg: access to basic services) and
                    (iii) features of a community (eg: access to information services);

and Y variables, or outcome variables, which are connected to:
                    (i) risky sexual behavior and
                    (ii) individual treatment of and attitudes towards others with HIV/AIDS.

Most of the outcome variables will be categorical or dummy variables. Finding out more about what factors influence the decisions to undertake more or less risky behavior in the face of the AIDS epidemic is certainly useful information for policy-makers. In addition, the government is concerned about not ostracizing people with HIV/AIDS from society, and so determinants of attitudes towards and treatment of others are also of prime importance for policy-makers to understand. We hope to be able to analyze the relationship between information that individuals report having about AIDS, and their attitudes towards and treatment of others with HIV/AIDS.

In this module, we will describe the basic problem with using the linear regression framework to model categorical outcome variables, and introduce the two types of models used instead: the logit and probit models. Both of these models deal better with dummy dependent variables. We can use these models to answer many of the following questions which we might be interested in:

Before we get to the logit and probit models, we look more carefully at why the linear regression framework doesn't always work well for modeling outcomes which are binary (either 0 or 1, in the case of dummy variables). To do this, we will consider the Linear Probability Model.
 

 

 

LINEAR PROBABILITY MODELS (LPM)

Stata will run a linear regression on any variables you give to it: as long as there is one Y variable and at least one X variable, the regression coefficients will be generated. However, to get sensible results out of the linear regression, you need to put sensible things into it!

Let's try this example. Suppose there is a particular group of individuals (characterized by age group, gender, or district of residence perhaps) that we suspect is not getting enough information about HIV/AIDS. There is a question in the survey which asks "Have you heard or seen information about HIV/AIDS?", and the answers are captured as Yes==1, No==2 and User-missing==7. Open bais.dta to check this.

First, let's get the variable into shape to serve as a dummy variable:
 

lab def yesno 0 "NO" 1 "YES"
tab hivinfo, nol
recode hivinfo (2=0) (7=.)
lab val hivinfo yesno
 
We'll also create a female dummy:
gen female=.
replace female=0 if gender==1
replace female=1 if gender==2
lab def female  0 "Male" 1 "Female"
lab val female female
keep if rec_per==1

Since we are going to focus on individual-level information, we can select only those observations for which the individual questionnaire was answered. Note that, if we need any household level variables in our analysis (like household size), we would need to re-open the data and start with the entire sample again.

As a preliminary investigation, we might ask: do females have more information about HIV than males? Do people in rural areas have less information? Do individuals with more education also have more information? We can investigate some of these bivariate relationships using simple tab, sum commands:

tab educ, sum(hivinfo)

  number of |  Summary of have you heard or seen
   years in |           info about HIV
     school |        Mean   Std. Dev.       Freq.
------------+------------------------------------
          1 |   .77142857   .42604296          35
          2 |    .6025641   .49253502          78
          3 |    .6173913   .48815098         115
          4 |   .67379679   .47008123         187
          5 |   .63829787   .48177621         188
          6 |   .64705882   .47876551         272
          7 |    .6916221   .46223568         561
          8 |    .6984127   .46016604         189
          9 |   .72508591   .44685504         582
         10 |   .71717172   .45094341         396
         11 |   .77027027   .42353041          74
         12 |   .82716049   .37869332         324
         13 |   .82352941   .38501337          51
         14 |   .95454545   .21070705          44
         15 |   .89583333   .30870928          48
         16 |   .96296296   .19245009          27
         17 |   .88235294   .32703497          34
         18 |           1           0           8
         19 |   .83333333   .40824829           6
         20 |           1           0           9
         21 |           1           0           3
         22 |   .66666667   .57735027           3
         23 |           1           0           1
         25 |           1           0           1
------------+------------------------------------
      Total |   .72002472   .44905616        3236
 
tab location, sum(hivinfo)
            |  Summary of have you heard or seen
location of |           info about HIV
  household |        Mean   Std. Dev.       Freq.
------------+------------------------------------
      Urban |   .80585366   .39573517        1025
  Urban Vil |     .708061   .45490223         918
      Rural |   .67206478   .46959691        1729
------------+------------------------------------
      Total |   .71840959   .44983593        3672
 
tab female, sum(hivinfo)
            |  Summary of have you heard or seen
            |           info about HIV
     female |        Mean   Std. Dev.       Freq.
------------+------------------------------------
       Male |   .72948328   .44436185        1645
     Female |   .70942279   .45414077        2027
------------+------------------------------------
      Total |   .71840959   .44983593        3672


All of these commands gives us the proportion of the people in that category who answered YES to the question. For example, in the table above, almost 71% of females have heard or seen information about HIV. This is the beauty of dummy variables: any average which is generated out of  values of 1's (yes's) and 0's (no's) will be the proportion of 1's in your sample. [Note that we would have generated very different results if we had not first recoded the hivinfo variable to be a dummy variable.] These tabs will give us some picture about the factors affecting HIV knowledge, but we want to answer the question more rigorously, controlling for all other variables we think might matter at the same time.

Remembering what we know about how to generate dummy variables out of categorical variables, let's run the following regression. We are not saying here that being of a certain age or education determines whether you have information about HIV or not. Rather you can think of the regression output as telling us how much more or less likely a person is to have information about HIV, given that they have a certain characteristic (live in a certain area, is of a certain age etc).
 

xi: reg hivinfo age female educ i.location
 

      Source |       SS       df       MS              Number of obs =    3236
-------------+------------------------------           F(  5,  3230) =   32.74
       Model |  31.4682868     5  6.29365736           Prob > F      =  0.0000
    Residual |  620.874111  3230  .192221087           R-squared     =  0.0482
-------------+------------------------------           Adj R-squared =  0.0468
       Total |  652.342398  3235  .201651437           Root MSE      =  .43843
------------------------------------------------------------------------------
     hivinfo |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0044326   .0006194     7.16   0.000     .0032182     .005647
      female |  -.0156337   .0156182    -1.00   0.317    -.0462562    .0149888
        educ |   .0171452   .0023696     7.24   0.000     .0124992    .0217913
_Ilocation_2 |  -.0772342   .0208552    -3.70   0.000     -.118125   -.0363434
_Ilocation_3 |  -.1056572    .018639    -5.67   0.000    -.1422026   -.0691117
       _cons |    .538162    .031462    17.11   0.000     .4764744    .5998495
------------------------------------------------------------------------------

Before we look at the output, remember how we interpreted coefficients in the linear regression model with continuous variables? We said, for example, for each year of additional education that an individual has, the age of first sex changes by bed years, where bed is the estimated coefficient on education. Here, the outcome variable is either YES (1) or NO (0). So we cannot interpret the coefficients as the change in Y for a given change in X, because Y can only take on two values. Instead, we will think about how a change in each variable affects the PROBABILITY that a person reported YES or NO.

For example, the _Ilocation_3 variable has a coefficient of -0.10. This means that compared to the base group (location==1, or urban area), someone living in a rural area is 10% less likely to report YES, that they have heard or seen some HIV information. Consider the age coefficient: it tells us that for each additional year a person has, the probability of them having heard some information about HIV rises, by 0.44%. A twenty-five year gap between two otherwise identical individuals will mean that the older person is about 25*0.004=0.1==> 10% more likely to report YES.

This is why linear regression with a dummy dependent variable is known as a Linear Probability Model - because the predicted outcome is a probability number, while the form of the relationship between the X and Y variables is assumed to be linear.

Notice the R-squared here; it's really small. This means that we cannot explain much of the variation in the YES/NO answer to this question, if we use these particular X-variables. However, since we are using a dummy dependent variable, we should not expect the R-squared to be very high. Why is this? Since the actual values are only ever 0 or 1, and the predicted values can be ANY number, including non-integers, the fit of the regression line to the actual data will never be exceptionally good. We'll see a picture further down, which will clarify this explanation.

Before we move on, are there any other possible X-variables which might play a role in affecting whether an individual has heard/seen any information about HIV?

Now, to illustrate the problem with linear regression in the context of dummy-dependent variables, let's predict the outcome variable for each individual:
 

predict infohat
summ infohat

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
     infohat |      3448    .7127905    .1016177   .4783424   1.163583

 

Note that the mean of infohat is 0.713. What information are we getting from this statistic? Simply that the predicted proportion of individuals in the sample who answered yes to the question is about 0.71. The actual mean in the sample (the proportion who actually did answer YES) was also about 0.71.

Now, remember that the outcome variable could only take on values of 0 and 1. What has happened in the predictions? We have predicted values of PROBABILITY which are greater than 1!! The maximum value for infohat, which is a probability number, is 1.1635. In fact, with certain values for some of the X-variables, we might have predicted a negative infohat. Neither of these situations is plausible (probability greater than 1, or negative), and this is the central problem with using linear regression for dummy dependent outcome variables.

Let's try to see this problem with another example, which deals with how individual characteristics and access to information affects attitudes. Suppose we want to know what factors affect whether an individual would want an HIV-positive teacher to remain teaching in school. What kinds of X-variables do you think will matter?

Let's do some cleaning first. The variable we need is
hivteach:

tab hivteach, nol
recode hivteach (2=0) (9=.)
lab val hivteach yesno
xi: reg hivteach age female educ i.location
predict teachhat
 
      Source |       SS       df       MS              Number of obs =    3077
-------------+------------------------------           F(  5,  3071) =  149.68
       Model |  144.235258     5  28.8470516           Prob > F      =  0.0000
    Residual |   591.83949  3071  .192718818           R-squared     =  0.1960
-------------+------------------------------           Adj R-squared =  0.1946
       Total |  736.074748  3076  .239296082           Root MSE      =    .439
------------------------------------------------------------------------------
    hivteach |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0059667   .0006419     9.30   0.000     .0047082    .0072252
      female |    .087467   .0160206     5.46   0.000     .0560547    .1188792
        educ |   .0545308   .0024291    22.45   0.000     .0497681    .0592936
_Ilocation_2 |  -.0953094   .0213145    -4.47   0.000    -.1371015   -.0535172
_Ilocation_3 |  -.1068599   .0191193    -5.59   0.000    -.1443478   -.0693719
       _cons |    .017347   .0322079     0.54   0.590    -.0458043    .0804983
------------------------------------------------------------------------------
 
How do we interpret these coefficients?

Each coefficient gives us the rate of change in the conditional probability of a YES response, for a given unit change in the explanatory variable. So,

 

If we look at the predicted values from our regression:

summ teachhat
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
    teachhat |      3448    .5844359    .2233902   .0246846   1.560158


We find the maximum prediction is again above 1. There is nothing about the structure of the LPM which restricts predicted values to be within suitable bounds for probabilities: between 0 and 1. Let's see this in graphical form.

sort hivteach teachhat
gen num=_n                         /*this generates a numerical label for each prediction*/
scatter hivteach teachhat num
scatter hivteach teachhat num, ylabel(0(0.25)1.5) xlabel(0(500)4000) ti("Fig 1: responses, & predicted probabilities from the LPM")

 

The graph without the frills looks like this:

                                                                Figure 1

 

While if we run the second version, we get this more polished graph:

                                                                Figure 2

What are the two flat blue lines in the graph? They represent the actual answers that people gave to the question: "Should HIV positive teachers be allowed to continue working in schools?" These responses could only be coded as 0 or 1. They are two flat lines in the graph, because we sorted the data first, by hivteach and teachhat. The sort command in Stata will sort the observations along whichever variable you specify first: in this case, all the 0's were ordered before all the 1's, and then within each group (the 0's, and 1's), the data was further sorted by the value of teachhat. You can check this is the case by doing

browse hivteach teachhat, nol

Notice that some of the predicted values for Y are outside of the [0,1] range. We could adjust these particular values by hand: that is, for all those observations which have predicted values above 1, set their predictions equal to 1 (and similarly, set the prediction equal to 0 if the prediction is negative). Alternatively, we could choose to use a model which does not allow any predictions to be outside of the [0,1] interval. There are indeed several widely-used models which solve this problem for us. They are what we turn to next. We will come back to how the LPM can still be useful, in conjunction with any of the following alternatives.

 

Before we continue though, try the following questions.

2. Is there reason to expect that individuals would have different attitudes on the hivteach question, depending on which district they live in? Are there some districts you would expect to be more accepting of teachers with HIV? Test your prior expectations, by running a version of the above model which uses the district variable instead of location. Do we get much more information out of running the regression in this way, or are location and district highly correlated?
Question 2 Answer
 
 
3. Do you think that students would have particularly strong feelings for or against allowing HIV positive teachers to continue in school? Create a student dummy variable using the information in labor market status (lmstat), and include this as a regressor in your linear probability model. Are students more or less likely to answer the question in the affirmative? Is this relationship between being a student and the response to the question a statistically significant relationship?
Question 3 Answer

 

 

 

BASIC ISSUES IN DISCRETE OUTCOME ANALYSIS

Instead of modeling the relationship between X and Y as linear, we want a function that will impose a non-linear relationship between the independent and dependent variables, and will restrict the predicted values of Y to be between 0 and 1.

The functions we will use are specific cases of cumulative distribution functions, or CDF's. If you are not familiar with what a CDF is, please click here. The logit model is based on the logistic distribution function, while the probit model is based on the familiar normal distribution function. Both of these models have attractive properties which the LPM does not:
 

  1. as X increases, Y increases but never moves outside of the [0,1] interval; and
  2. the relationship between the Y variable and X independent variables is assumed to be non-linear. That is, when there is a positive relationship between X and Y, for a particular observation, Yi approaches 0 more and more slowly as Xi gets small, and Yi approaches 1 more and more slowly as Xi gets larger and larger. In calculus-speak, this means dY/dX is not constant.


It is perhaps easiest to see what is going on with each of these models, in contrast to the LPM, in the figure below:

[Jump forward to discussion of model selection if you are revisiting this graph.]

Notice that the way our data points are arranged, there is a positive relationship between X and Y: higher values of X are associated with higher values of Y, even though Y can only be 0 or 1. The LPM simply fits a straight line through the points, and so may generate predicted values which are outside of the [0,1] range for probabilities. The logit and probit, on the other hand, 'squash' the relationship between X and Y to fit inside the [0,1] range. Consider the right hand extreme of these lines: as we approach Y=1, any changes in X have almost no impact on the probability of Y =1, since Y is already almost there. In this way, the logit and probit models generate sensible predictions for probability numbers.

The main difference between the logistic CDF and the normal CDF is that the former has slightly fatter tails (there is more mass in the tails of the distribution that in the normal distribution). The choice of which to use is generally a matter of taste or convenience of interpretation. We will return to a discussion of choice of model below. However, before we go into the probit model in more detail, it useful to explicitly write out the form of the dummy-dependent variable models we have been considering.
 

The model set up for the LPM was as follows:

                        Y = bX + e                                                       (A)                               
                        Y = 0 if response is NO
                            = 1 if response is YES
                        X = independent variable(s)
                        b = coefficient on the independent variable(s)
                        e = stochastic error term

 


Recall that the slope of the LPM line in the diagram is simply b (if we have just one independent variable).

In contrast, the set-up for both of the non-linear models is the following:

                        Y = F(Xb) + e                                                    (B)
                        Y = 0 if response is NO
                            = 1 if response is YES
                        X = independent variable(s)
                        b = coefficient on the independent variable(s)
                        e = stochastic error term
                        F(.) = a CDF, which takes the value of Xb and transforms it into a probability number.


 

This is not the place to go into any more detailed theory, but if you have never seen probit or logit models before, there are several references at the end of this module which you might want to take a look at. At this point, what we need to know is that Stata will perform whichever model you ask it to, it will use a technique called maximum likelihood estimation to retrieve the coefficients you want, and there will be some difference in the way you interpret the output, compared to how you did this in the linear regression framework.

Let's go to some examples.

 

 

 

PROBIT MODELS

In this model, the CDF we use [F(.) in equation B above] is the normal distribution function. We'll use the probit to analyze the effect of age, education, gender, urban residence and knowledge about HIV on the probability of an individual responding YES to the question "Should an HIV positive teacher be allowed to continue teaching?". We will stay with this example for the moment, to compare our results to the results from the LPM.

char location[omit] 3 
xi: probit hivteach age educ female i.location hivinfo
i.location        _Ilocation_1-3      (naturally coded; _Ilocation_3 omitted)
Iteration 0:   log likelihood = -2063.6213
Iteration 1:   log likelihood = -1720.5449
Iteration 2:   log likelihood = -1704.3112
Iteration 3:   log likelihood = -1704.2015
Iteration 4:   log likelihood = -1704.2015
Probit estimates                                  Number of obs   =       3074
                                                  LR chi2(6)      =     718.84
                                                  Prob > chi2     =     0.0000
Log likelihood = -1704.2015                       Pseudo R2       =     0.1742
------------------------------------------------------------------------------
  hivteach |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0192975   .0020391     9.46   0.000     .0153008    .0232941
        educ |   .1853501   .0092052    20.14   0.000     .1673082    .2033919
      female |   .2409078   .0503665     4.78   0.000     .1421912    .3396244
_Ilocation_1 |   .3112355   .0609569     5.11   0.000     .1917622    .4307089
_Ilocation_2 |   .0237823   .0607588     0.39   0.695    -.0953028    .1428674
     hivinfo |   .1952305    .055225     3.54   0.000     .0869916    .3034695
       _cons |  -2.082644   .1052039   -19.80   0.000    -2.288839   -1.876448
------------------------------------------------------------------------------

 

Below are some notes related to the probit output:

 

Probit Output - Note 1

By using the char varname[omit] number command, we ensured that the base (or omitted) category for area of residence is rural. The individual who represents the base category for the regression is a male individual, with no education and zero years of age, who has not heard or seen any information about HIV.

 

Probit Output - Note 2

Before we get the output table, there are 5 lines of "Iterations". Stata will do this each time a model is 'solved' using maximum likelihood techniques. To read more about maximum likelihood techniques, click here.

 

Probit Output - Note 3

Interpreting coefficients:

(i) SIGN: since all the variables have positive coefficients, we can say that each of these attributes increases the likelihood of an individual reporting YES, relative to the base or reference group.

(ii) SIZE: It is hard to interpret probit coefficients, because of the structure of the model. Each coefficient adds to (or subtracts from, if the sign is negative) the underlying number - or Z-score - which is then fed into the normal CDF to produce a probability number. We care about the probability number for interpretation, but we see the underlying Z-score in the probit output.

If we look back at the graph with the normal CDF, we can see that if there is an increase in Xb of 0.1 at either tail, the impact on the probability is really small. However, if we change Xb by 0.1 towards the middle of the distribution, the impact of this change on the probability will be large. Thus, to interpret the effect of a change in an X variable on the probability of the outcome happening, we have to specify what Xb value we are starting from. The crucial thing to note is that the marginal effect of a unit change in an X variable is not the same, for every value of X. In a moment, we will learn a command which tells Stata to calculate the effects of a change in each of the independent variables, when we start from the mean values of each X variable. Before we get there though, let us do one interpretation of a coefficient from the probit output - the hard way!

Suppose we have an individual who is 20 years old, has 10 years of education, is male, and has not heard any information about HIV. We want to know what the impact of moving from a rural to an urban area is, for this particular person. Using the model we wrote out above in equation (B) we can generate a predicted probability for this individual, in each of these two states of the world: when he is in the rural area, and when he is in the urban area.

Yhat[for our man in the rural area]

= F(Xbhat)

= F(bage*20 +beduc*10 + bfemale*0 + burban*0 + burban village*0 + bhivinfo*0 + constant)

= F(.0192975(20)+.1853501(10)+.2409078(0) +.3112355(0) + .0237823(0) +.1952305(0) -2.082644)

= F(.0192975(20)+.1853501(10) -2.082644)

=F(.156807)
 

To evaluate this, we can ask Stata to read this z-score, and give us back the associated probability number from the standard normal distribution:

di norm(.156807)
.56230152

 

This tells us that such an individual has a 56% chance of responding YES to the question. What is the marginal effect of urban area? Using the same approach, we can calculate:

Yhat[for our man in the urban area]

= F(Xbhat)

= F(bage*20 +beduc*10 + bfemale*0 + burban*1 + burban village*0 + bhivinfo*0 + constant)

= F(.0192975(20)+.1853501(10)+.2409078(0) +.3112355(1) + .0237823(0) +.1952305(0) -2.082644)

= F(.0192975(20)+.1853501(10)+.3112355(1) -2.082644)

= F(.4680425)

 

Again, to get the probability associated with this z-score, we type:

di norm(.4680425)
.6801229

The difference between these two probabilities represents the marginal effect of location for this male individual of age 20, with 10 years of education and no information about HIV:

di .6801229-.56230152
.11782138

 

The marginal effect, for this type of individual, of moving from a rural to an urban area is an increase in the likelihood of answering YES, by 12%.  This doesn't look ANYTHING like the 0.31 coefficient on _Ilocation_1!

This is a very tedious way to interpret the probit coefficients, but the example is a warning that the probit output is not as straightforward to think about as the linear regression output.

We can check that this is consistent with results from the LPM model by doing the following:

xi: reg hivteach age female educ i.location
      Source |       SS       df       MS              Number of obs =    3077
-------------+------------------------------           F(  5,  3071) =  149.68
       Model |  144.235258     5  28.8470516           Prob > F      =  0.0000
    Residual |   591.83949  3071  .192718818           R-squared     =  0.1960
-------------+------------------------------           Adj R-squared =  0.1946
       Total |  736.074748  3076  .239296082           Root MSE      =    .439
------------------------------------------------------------------------------
    hivteach |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0059667   .0006419     9.30   0.000     .0047082    .0072252
      female |    .087467   .0160206     5.46   0.000     .0560547    .1188792
        educ |   .0545308   .0024291    22.45   0.000     .0497681    .0592936
_Ilocation_1 |   .1068599   .0191193     5.59   0.000     .0693719    .1443478
_Ilocation_2 |   .0115505   .0196732     0.59   0.557    -.0270235    .0501245
       _cons |  -.0895129   .0283406    -3.16   0.002    -.1450812   -.0339445
------------------------------------------------------------------------------

Notice that the urban effect is about 11%, which is close to our calculation of 12%.

 

(iii) SIGNIFICANCE: all of the coefficients in the probit, except for _Ilocation_2, are statistically significant at the 1% level. This means that they are all significantly different from zero, except for urban villages.

[Note: if your probit results have disappeared from your screen, just type probit again, and the results from the most recent probit model will re-appear! However, this will only work if you have not run any other regression or model in between.]

 

Probit Output - Note 4

We can see from the predict and sum commands, that the probit has done what we have required: it has not predicted values for teachhat which are outside of the [0,1] range. Before you run the predict command, remember to re-run the probit command, otherwise Stata will predict from the LPM that we have just run!

xi: probit hivteach age educ female i.location hivinfo
predict teachhatprob
summ teachhatprob
 
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
teachhatprob |      3236    .6020995    .2246865   .0560828     .99988
 
We can use very similar syntax to before, to graph out the predicted values from the probit and the LPM, together with the underlying data.
predict teachhatprob
summ teachhatprob
corr teachhatprob teachhat
 
             | teachh~b teachhat
-------------+------------------
teachhatprob |   1.0000
    teachhat |   0.9626   1.0000

 

Here, we can see that the predictions from the LPM (teachhat) and from the probit (teachhatprob) track each other well, with a correlation of 0.96.

scatter hivteach teachhat teachhatprob num, ti("Fig 3: Actual and predicted values of HIVTEACH")

			Figure 3

Notice that the green dots, which are from the probit predictions, all lie within the [0,1] range, while the red dots, which are all from the LPM predictions, move outside of the [0,1] range.

One way to see how the marginal effects are different for different values of X (which does not involve calculating Z-scores and probability numbers by hand for every observation!) is to use a graph to investigate. Suppose we want to know the effect of getting HIV information, for those individuals who report they have not heard or seen any information about the disease. So, for those people who currently have hivinfo==0, what happens to their predicted probability of reporting YES to the hivteach question, when we change their hivinfo variable to =1?

We've already generated a prediction for the whole sample; it was called teachhatprob. Now we need to generate another prediction, where everyone has their hivinfo variable set to = 1. First, re-do the probit, and then follow this syntax:
 

char location[omit] 3 
xi: probit hivteach age educ female i.location hivinfo

gen teachhat2=norm(-2.082644 + age*.0192975 + educ*.1853501 + female*.2409078 + _Ilocation_1*.3112355+ _Ilocation_2*.0237823 +.1952305*1)

What we are telling Stata to do is generate a predicted probability for each observation, using each observation's actual values for age, education, female and location, but instead of using the actual value for hivinfo, let EVERYONE in our sample have a value of 1 for hivinfo. This means, do the prediction for everyone, under the assumption that they all report yes to the question "Have you seen or heard any information about HIV?.

What we are going to graph is all the predicted probabilities for those people who report hivinfo==0, at the value of hivinfo==0, and then again the predictions for these individuals when we set their hivinfo==1."
 

Now, we can graph each of these predictions against the probability predictions for only those individuals who report hivinfo=0:

line teachhatprob teachhat2 teachhatprob if hivinfo==0, c(l l) sort 
line teachhatprob teachhat2 teachhatprob if hivinfo==0, c(l l) sort xline(0.3 0.7)

 

                                                Figure 4

The blue line in the above graph is a 45 degree line - it plots the predicted probabilities from the probit for those who have hivinfo==0 against the same probabilities. The red line indicates what the probability prediction would be for these same individuals if, instead, they had hivinfo==1. You can clearly see that giving information to those who have no information about HIV has larger effects for those who start out with predicted probabilities between 0.3 and 0.7 (approximately!).
 

 

Probit Output - Note 5

Other output from the probit command: the likelihood ratio test (LR test) is a test for the joint significance of all coefficients in the model. The null hypothesis in the test is that all coefficients are jointly zero. The test statistic which is generated from the probit is given by the LRchi2(6) line, where 6 represents the degrees of freedom used up in the model. In our case, we have 6 explanatory variables, and so 6 degrees of freedom.

The Prob(chi(2))>0 in the upper right hand side of the output table tells us that the outcome of the LR test is significantly different from zero, which means we reject the null that the coefficients are all jointly zero. We can be confident that our model is explaining some part of the variation in responses to this question.

 

 

A Simpler Way To Derive Marginal Effects From The Probit

One command which we can use to make interpretation a little easier, is the dprobit command. This command will also fit the probit model, but instead of reporting the raw coefficients as in the table above, Stata reports the change in the probability for an infinitesimal change in each independent, continuous variable and, by default, the discrete change in the probability for dummy variables. These impacts are evaluated at the mean values of X. Click here for more detail on how this is done.

Instead of asking Stata to run the probit model, we substitute dprobit in the command line:

xi: dprobit hivteach age educ female i.location hivinfo
Probit estimates                                        Number of obs =   3074
                                                        LR chi2(6)    = 718.84
                                                        Prob > chi2   = 0.0000
Log likelihood = -1704.2015                             Pseudo R2     = 0.1742
------------------------------------------------------------------------------
  hivt~h |      dF/dx   Std. Err.      z    P>|z|     x-bar  [    95% C.I.   ]
---------+--------------------------------------------------------------------
     age |   .0072661   .0007669     9.46   0.000   25.8351   .005763  .008769
    educ |   .0697898   .0034082    20.14   0.000   8.35198    .06311   .07647
  female*|   .0909303   .0190248     4.78   0.000   .557254   .053642  .128218
_Iloca~1*|   .1141349   .0216304     5.11   0.000   .300586    .07174   .15653
_Iloca~2*|   .0089372    .022787     0.39   0.695   .262199  -.035724  .053599
 hivinfo*|   .0744483   .0212671     3.54   0.000   .727716   .032766  .116131
---------+--------------------------------------------------------------------
  obs. P |   .6040989
 pred. P |   .6330931  (at x-bar)
------------------------------------------------------------------------------
(*) dF/dx is for discrete change of dummy variable from 0 to 1
    z and P>|z| are the test of the underlying coefficient being 0

 

Here, we can interpret the coefficients directly as marginal effects, as we did before in the linear probability model. However, these are marginal effects on the probability of reporting YES for the individual with the mean values of each X:


The downside to using the dprobit command is that there may not actually be an individual, or a group of individuals, with mean values of X. This means that the marginal effects that you report from this output table are not relevant to the actual observations in your data set. One way to deal with this is to graphically look at how your sample is distributed within each X-characteristic. If much of the sample is clustered around the mean values of each X, then you are probably safe to use dprobit.

What have we learned from this exercise? Women are more likely than men to say that HIV positive teachers should remain teaching, as are individuals who have prior  information and knowledge about HIV. More education makes one more likely to answer YES to the question, whereas someone living in a rural area is much less likely to want to allow infected teachers to stay in school.

Now it's your turn, answer the following question:

4. Use the probit model to analyze what factors make an individual more likely to answer YES to the question Q514 "Can people get HIV/AIDS because of witchcraft?". What is the marginal effect of hivinfo in this model?
Question 4 Answer


 
 
An Alternative Way To Think About The Probit Model
 
In many situations, the probit model can be given a latent-variable interpretation. We review this here, because it is sometimes clearer to motivate the use of a probit model in terms of a latent dependent variable. Let's continue with our example of what factors are important for explaining why an individual would want an HIV positive teacher not to continue teaching in school.

The latent variable model is set up as follows:

I = Xb + e

Y = 1     if     I >   c
    = 0     if     I <= c       
 

So, the     Pr(Y=1) = Pr(I>c)
                Pr(Y=0) = Pr(I<=c)     
           

where:               I is the latent variable, which we cannot observe
                         X is our usual set of independent variables
                         Y is the outcome we care about; in this case, the YES answer to our question. We can observe this.
                         c is some cut-off point, or hurdle value.


The motivation for the model is straightforward enough: each individual's X-factors make them more or less likely to answer
YES to the question. We can't see how likely they are to answer YES {the I variable), we only observe the outcome, yes or no response {the Y variable}. But we can model the probability of a YES answer, by thinking about the probability of the latent variable being above or below a cutoff point. In the case we have been considering, Pr(Y=1) = Pr(I>c) = F(Xb) where F(.) is the standard normal CDF - the function which translates our underlying latent variable number into a suitable probability between 0 and 1.

More detail on using the latent variable motivation of the probit model is provided in the Greene (2003) and Johnston and DiNardo (1997) references provided at the end. Right now, let's turn to another model for dummy-dependent variables: the logit model. The use of the logit model may also be motivated in terms of this latent-variable approach.

 

 

 

LOGIT MODELS

In this model, the CDF we use [F(.) in equation {B} above] is the logistic distribution function. This distribution function is a little easier to see written down than the standard normal distribution:

Y = F(Xb) + e
 

where       F(Xb) = exp(Xb)/(1+exp(Xb)) = eXb/(1+eXb)            (C)
                exp = natural e


 

We'll use the logit to analyze the effects of age, education, gender and urban residence on the probability of an individual responding YES to the question Q309_1: "Did you use a condom the first time you had sex with your most recent partner?"

The variable we need is called
part1con1, and it needs some initial cleaning up:

tab part1con1
tab part1con1, nol
recode part1con1 (2=0)
lab val part1con1 yesno

 

We will also generate an interaction variable for education*gender:

gen femeduc=female*educ
char literacy[omit] 3 
char location[omit] 3
xi: logit part1con1 age female i.location femeduc educ i.literacy
i.location        _Ilocation_1-3      (naturally coded; _Ilocation_3 omitted)
i.literacy        _Iliteracy_1-3      (naturally coded; _Iliteracy_3 omitted)
Iteration 0:   log likelihood = -1139.6791
Iteration 1:   log likelihood = -845.30504
Iteration 2:   log likelihood = -832.06356
Iteration 3:   log likelihood = -831.79186
Iteration 4:   log likelihood = -831.79169
Logit estimates                                   Number of obs   =       1739
                                                  LR chi2(8)      =     615.77
                                                  Prob > chi2     =     0.0000
Log likelihood = -831.79169                       Pseudo R2       =     0.2702
------------------------------------------------------------------------------
   part1con1 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.1287028   .0071621   -17.97   0.000    -.1427402   -.1146655
      female |   -1.33864   .3525716    -3.80   0.000    -2.029667   -.6476121
_Ilocation_1 |   .1567167   .1425106     1.10   0.271    -.1225989    .4360323
_Ilocation_2 |   .1860074    .158996     1.17   0.242     -.125619    .4976338
     femeduc |   .0934957    .036732     2.55   0.011     .0215024    .1654891
        educ |   .0273958   .0254095     1.08   0.281     -.022406    .0771975
_Iliteracy_1 |   1.271578   .4779464     2.66   0.008     .3348202    2.208336
_Iliteracy_2 |   .7676351   .4840312     1.59   0.113    -.1810485    1.716319
       _cons |   3.422056   .5161311     6.63   0.000     2.410457    4.433654
------------------------------------------------------------------------------

 

Below are some notes related to the logit output:

 

Logit Output - Note 1

Again, we see the iteration steps that Stata goes through in order to get the estimated coefficients. This is because
Stata finds the logit coefficients using maximum likelihood techniques.

 

Logit Output - Note 2

Interpreting coefficients:

(i) SIGN: older individuals, and women, are much less likely to report that they used a condom the first time they had sex
with their most recent partner. More literacy is particularly associated with an increased probability of reporting use of a
condom at first sex with most recent partner. This positive effect is present for the female*education interaction term (that
is, more educated women are more likely to report YES than less educated women and men educated at the same level) and both of the non-rural variables.

(ii) SIZE: there are a couple of ways that we can report logit coefficient, but none are as easy as using the dprobit
command! Click here for a calculus version of generating these marginal effects.

The rule we will use is the following:

dY/dX = b*F(Xb)*(1-F(Xb))

where         F(Xb)       = probability that response was YES
                  (1-F(Xb))  = probability that response was NO
                   b              = the coefficient of interest.


Now, we could use a number of values for F(Xb):

(a) We could calculate the marginal effect of a change in one X variable, where we use the sample proportion of actual YES
answers in place of F(Xb), and the sample proportion of actual NO answers in place of (1-F(Xb)).
 

sum part1con1 
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
   part1con1 |      2054    .5754625     .494393          0          1

 

Here, the sample mean is .58. We can do a back-of-the envelope calculation to find the marginal effect of being literate
(_Iliteracy_1==1) on the probability of responding YES:
 

. di 1.271578*.5754625*(1-.5754625)
.31065339

 

Thus, the average increase in probability of reporting YES for the sample under consideration is about 31%! Literacy seems to
matter a lot for determining whether a condom is used for the first time an individual has sex with their most recent partner.
We will check more rigorously for statistical significance below.

(b) Another way to evaluate the marginal effects of a change in one of the X variables is to use the mean of the predicted
values for F(Xb) instead of the actual sample mean.
 

predict part1con1hat
summ part1con1hat
 
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
part1con1hat |      3445    .7289634     .281968   .0039366   .9797294

 

Notice that the predicted mean is rather different from the sample mean. In this case, the marginal effect of being literate is:

di 1.271578*.7289634 *(1-.7289634 )
.25123299

 

This prediction is very different to the first, because the values of X at which we are predicting are different. If we
wanted to, we could evaluate F(Xb) at the smallest value of X, or the largest values of X, or the mean values of X, and find
different predictions in each case. Your choice of which marginal effect to report should be guided by the point of your
study. However, many researchers will discuss marginal effects of particular interest for mean values of X in the
distribution.
 


iii) A third way of interpreting the output is in terms of odds ratios. An odds ratio tells us how much more likely it is for
an individual to report YES, than to report NO. Thus, we can write:
 

Pr(YES)/Pr(NO) 
= p/(1-p) 
= F(Xb)/(1-F(Xb))
= exp(Xb)			(**)

where	p = our shorthand for Pr(YES)
	F(Xb) = probability of reporting YES based on our underlying score, Xb
	exp(Xb) = the expression in (**) evaluated using the logistic CDF given in (C) above.
In short, if we wanted to know how much more likely a literate person was to report yes than to report no, we could simple 
exponentiate the beta-coefficient on _Iliteracy_1:
di exp(1.271578)
3.566476

 

Thus, a literate person is about 3 and 1/2 times more likely to report YES, they had used a condom the first time they had
sex with their most recent partner, than they are to report NO. If you are familiar with the log function, there is another interpretation you might appreciate if you click here.

Reporting odds ratios is one command that Stata can do pretty easily. If we append the logit command above with ,or (which stands for odds ratio), we get the following:
 

xi: logit part1con1 age female i.location femeduc educ i.literacy, or
Logit estimates                                   Number of obs   =       1739
                                                  LR chi2(8)      =     615.77
                                                  Prob > chi2     =     0.0000
Log likelihood = -831.79169                       Pseudo R2       =     0.2702
------------------------------------------------------------------------------
   part1con1 | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .8792352   .0062971   -17.97   0.000     .8669793    .8916644
      female |   .2622021    .092445    -3.80   0.000     .1313792    .5232938
_Ilocation_1 |   1.169664   .1666895     1.10   0.271     .8846184    1.546559
_Ilocation_2 |   1.204431   .1914997     1.17   0.242     .8819508    1.644825
     femeduc |   1.098006   .0403319     2.55   0.011     1.021735     1.17997
        educ |   1.027774   .0261153     1.08   0.281     .9778432    1.080255
_Iliteracy_1 |   3.566476   1.704584     2.66   0.008     1.397689    9.100557
_Iliteracy_2 |   2.154665   1.042925     1.59   0.113     .8343949    5.564008
------------------------------------------------------------------------------

 

Notice that the coefficient on _Iliteract_1 is the one we have just calculated above!

So interpreting the rest of these coefficients should be straightforward. For example, the odds of females reporting YES are about .26 times higher than the odds of them reporting NO, whereas living in an urban area makes an individual 1.16 times more likely to report YES than NO.
 


(iii) SIGNIFICANCE: age, female and the literacy variable are all significant at the 1% level. This means we are sure that
99% of the time, our estimates will be significantly different from zero. This might be something to worry about. Are there
other variables you can think about which might affect whether someone uses a condom the first time they have sex with a most
recent partner? How about their current marital status?


Try the following exercises to be sure that you are comfortable with interpreting logit output.

5. Using the output from the initial logit regression (the one not in odds ratio form), calculate the marginal effect of being female on the probability of reporting YES to the question. Do this in each of the three ways we have discussed, and be sure to think about how the interaction term affects your calculation.
Question 5 Answer


6. What is the effect of adding a dummy variable into the logit model, where the dummy ==1 if currently married, and ==0 if not?
Do any of the other coefficients change? How can you interpret this coefficient? Is it statistically different from zero?
Question 6 Answer


 

Logit Output - Note 3

GOODNESS OF FIT: As in the probit analysis, the statistic used to test for joint significance of all variables is the likelihood ratio (LR) test. The LR statistic is presented in the upper right hand corner of the logit output table, where you can also see the 8 degrees of freedom (because we have included 8 independent variables in the model). The Pr>chi(2) = 0.000 implies that we can decisively
reject the null hypothesis that all of the slope coefficients are not significantly different from zero.

Another way to test how good the logit model (and for that matter the probit model too) involves calculating the percent
of outcomes correctly predicted. Once we have predicted the values from the model, we can count up how many are correctly
classified as YES's (prediction>0.5) and how many are correctly classified as NO's (prediction<0.5).

Try the following syntax:
 

tab part1con1
count if part1con1hat>=0.5&part1con1==1
count if part1con1hat<0.5&part1con1==0

Now we can compute a weighted average of correctly predicted values from this information using the following formula:

Ave correct predictions

= (actual YES/total sample)*(predicted YES/actual YES) + (actual NO/total sample)*(predicted NO/actual NO)

= (predicted YES + predicted NO)/(total sample)

di "the weighted average of correctly predicted values is " 1056/2054 + 377/2054
the weighted average of correctly predicted values is .6976631

 

This means that in almost 65% of cases, the model predicts the outcome correctly. It is generally up to the researcher to decide whether this is a satisfactory prediction result or not.

 

 

 

 

MODEL SELECTION: LOGIT OR PROBIT?
 

The difference between the two models is minimal; if you consider the graph of the logistic and standard normal distribution
above (example here), you can see the shapes of the two are very similar, and identical in the middle. The main difference is that the logistic distribution has slightly flatter tails - there is less probability mass at the end points of
the distribution. The choice of which model to use is really one of preference, as both will provide similar estimates.

To see how similar the predictions from these two models are, we can return to our hivteach example and run the same specification in the logit framework, find the predictions and graph out the actual observed answers to the question (hivteach), and the predictions from the LPM (teachhat), the probit (teachhatprob) and the logit.
 

xi: logit hivteach age educ female i.location hivinfo
predict teachhatlog

lab var teachhatprob "Pr(hivteach) from the probit" /*this labels the variable so it looks good in the graph!*/
lab var teachhatlog "Pr(hivteach) from the logit"
lab var teachhat  "Pr(hivteach) from the LPM"

scatter hivteach teachhat teachhatprob teachhatlog num, ti("Fig 5: Actual and predicted values of HIVTEACH")

 

                                                    Figure 5

 

The LPM, probit and logit predictions fall neatly almost on top of each other!


 

7. Consider the logit model you have just run for the question about hivteach. Interpret the marginal effect of moving to an urban area. Do you get similar results to the probit? What about to the LPM?
Question 7 Answer
 

 

Why linear regression can sometimes be a useful sensitivity check

You have seen that interpreting the logit and probit model output is not always easy. The betas in the LPM are much more
intuitive to think about. Even though the LPM has those problems mentioned above, it is still sometimes useful to start off
using this model to analyze your data. Often, a paper will report coefficients from an LPM and a logit model, or the LPM and
a probit model. If the specification is correct (the right X-variables are included, and no extra unnecessary X's are
included), then the two models should not produce wildly different answers: the betas should not all be switching sign, or
jumping around in magnitude. Thus, comparing the LPM to the logit or probit output ( that is, if you compute the marginal
effects in each of the three models at the mean values of X) should serve as a loose specification check, and using the LPM
to start out your research is a good diagnostic tool - if you find ridiculous results with the LPM, most likely your results will
still be ridiculous when you turn to the probit or logit.

 

One more problem with dummy dependent variable models: heteroscedasticity

Without going in to too much econometric detail, it is important to raise the subject of heteroscedasticity. When a Y variable is a dummy variable, it can only take on two possible values, and this leads to problems of non-constant variance of the error term (if this sentence is Greek to you, ignore this section and rather read the introductory chapters on dummy dependent variables in a text book like Gujarati, given in the reference list below). The point is that if we run the LPM, logit or probit model without bearing this problem in mind, we will have incorrect standard errors in our output tables. They will be systematically underestimated. This means that we could interpret coefficients as significant, when in fact they may not be.

Stata can correct the standard errors for us, to deal with this heteroscedasticity, by using the robust command after your estimation syntax:

xi: logit hivteach age educ female i.location hivinfo, robust

You can compare your results from this table to the one without the robust command, and you'll notice that the only thing that changes are the standard errors: some of them become larger. While this correction works well in the LPM framework, the heteroscedasticity problem is a little more complicated in the logit and probit models. You are referred to chapter 13 in Johnston and DiNardo, for further details.

 

 

Now that you have covered the set of most popular models for dummy dependent variable analysis, use the LPM, the probit and the logit model to answer the following questions.

8. We want to know whether households which have been affected by family illnesses or deaths or an influx of orphans in the past year have access to support networks. In particular, we'd like to find out whether female-headed households are more or less likely to get support, whether the size of the household matters for getting assistance, and whether households in some parts of the country are more likely to be in the vulnerable no-support category.

Construct a measure of household support (using outsidehelp1, outsidehelp2, hhorph1), a measure of household size (using egen), a female head dummy (this is tough!), a proportion of workers in the household variable (using egen, work and hhsize) and a proxy for wealth (using toilettype==1 and transport1==1). Then, restrict your data to one observation per household, and run each of the models we have covered, using hhsupport as the dependent variable.

i) interpret the sign of your coefficients in each model. Are they consistent across models?

ii) interpret the significance of your coefficients. Is there consistency across models? {Note here, that the sample size for this household level analysis gets really small, and so significance becomes an issue. Generally, you want to be careful not to use very complicated models with very few data points, because it becomes more and more difficult to say anything about marginal effects with confidence.}

iii) Consider the female head variable in more detail. Interpret the marginal effect of having a female head in the household, on the probability of getting outside support for the household. Do this in each of the three models. Are your answers somewhat consistent? Is the sign and size of this coefficient something that you would have expected?

iv) Construct the predicted values from each of the models, and calculate the percentage correctly predicted by each model. Which model performs best, in terms of getting the highest proportion of correct predictions?

Question 8 Answer

 

 

 

ADDITIONAL RESOURCES

Greene, W. 1997. Econometric Analysis. (3e) Prentice-Hall. [There are also more recent 4th and 5th editions now available]

Gujarati, Dadomar. 2003. Basic Econometrics. McGraw-Hill. [Any edition of this introductory econometrics textbook is useful; some of the material for this module was based on content from the second edition, chapter 15: "Regression on dummy-dependent variables"]

Johnston, J. and DiNardo, J. 1997. Econometric Methods (international edition). McGraw-Hill. [Ch 13: Discrete and limited dependent variable models]

 

 

BACK TO TOP