TABLE OF CONTENTS
Introduction
Linear Probability Models
Basic Issues in Discrete Outcome Analysis
Probit Models
Logit Models
Model Selection: Logit or Probit
Additional Resources
As you may have noticed from looking at the questionnaires in the BAIS, many
questions capture information which cannot be expressed in the form of a
continuous variable. Let's open the data and the individual questionnaire, and
have a look at some cases.
There are questions like:
Example 1:
Q304: "What is your relationship to [NAME OF MOST RECENT/NEXT MOST RECENT
PARTNER]?"
Answer:
......Husband/Wife
......Girlfriend/boyfriend/not living with you
......Someone whom you paid or who paid you for sex
......Casual acquaintance
......Other
{Variable name: part1,
part2, part3}
Example 2:
Q309: "Did you use a condom the first time you had sex with this partner?"
Answer:
......YES/NO/Don't recall
{Variable name: part1con1,
part2con1,
part3con1}
Example 3:
Q314: "The last time you had sex, did you or this partner do anything to delay
or avoid pregnancy?"
Answer:
......YES/NO/Don't recall
{Variable name: part1preg,
part2preg,
part3preg}
Example 4:
Q604: "If a teacher has HIV/AIDS but is not sick, should she be allowed to
continue teaching in school?"
Answer:
......YES/NO/Don't know
{Variable name: hivteach}
These variables are coded as dummy variables - as in the case of
hivteach, or as categorical
variables, as in the case of the part1,
part2 and
part3 variables. Even though they are coded numerically (check
this by "tab varname, nolabel"),
these numbers don't really have a natural
meaning. There is, for example, no way to numerically rank the answers for any
of the questions above.
Many of these types of variables are likely to answer interesting questions we
might have. For example, we might care about the factors which are important for
explaining why someone uses a condom the first time they have sex. Here, the
dependent variable is 0=did not use a condom the
first time you had sex with this partner, and 1=did use a condom. Or, we might
care about the factors affecting whether a person thinks that HIV positive
teachers should be allowed to continue teaching. In this case, the Y-variable
==0 for a NO response, and Y==1 for a YES response.
Although we can identify X and Y variables which are important for answering
policy-related questions, we will see shortly that the linear regression
framework that we have been using to link the dependent and independent
variables in a meaningful way is not really the right model to use when we
have categorical outcome variables.
Before we continue, try the next question:
1.
Can you locate questions at the individual and household level
which generate data in categorical variable format? Which are coded as dummy
variables?
Question 1 Answer
As focus questions for today, we will try to understand the relationship
between X variables including:
(i) observable features of individuals (eg: gender, age, education)
(ii) observable features of their homes (eg: access to basic services) and
(iii) features of a community (eg: access to information services);
and Y variables, or outcome variables, which are connected to:
(i) risky sexual behavior and
(ii) individual treatment of and attitudes towards others with HIV/AIDS.
Most of the outcome variables will be categorical or dummy variables. Finding
out more about what factors influence the decisions to undertake more or less
risky behavior in the face of the AIDS epidemic is certainly useful information
for policy-makers. In addition, the government is concerned about not
ostracizing people with HIV/AIDS from society, and so determinants of attitudes
towards and treatment of others are also of prime importance for policy-makers
to understand. We hope to be able to analyze the relationship between
information that individuals report having about AIDS, and their attitudes
towards and treatment of others with HIV/AIDS.
In this module, we will describe the basic problem with using the linear
regression framework to model categorical outcome variables, and introduce the
two types of models used instead: the logit and probit models. Both of these
models deal better with dummy dependent variables. We can use these models to
answer many of the following questions which we might be interested in:
- Are some districts more likely than others to want HIV + teachers to continue teaching?
- Are people who are more educated likely to want to be more open about the incidence of HIV in their families?
- Are individuals who have children of school-going age more or less likely to want HIV positive teachers to continue teaching in schools?
- To what extent is the decision to use a condom for the first sex with a new partner affected by how much information an individual has about risk factors for contracting STD's and HIV?
- Does using alcohol exacerbate risky behavior, in terms of reducing the use of condoms during sex?
- Are pregnant women more likely to be offered information about HIV or an HIV test in clinics in different parts of the country?
- What factors are important for determining whether households received outside support in response to a family illness or death? Are female headed households more or less likely to receive support from outside of the household during these times of financial and emotional hardship?
- Are female headed households more or less likely to take on orphans?
Before we get to the logit and probit models, we look more carefully at why
the linear regression framework doesn't always work well for modeling outcomes
which are binary (either 0 or 1, in the case of dummy variables). To do this, we
will consider the Linear Probability Model.
LINEAR PROBABILITY MODELS (LPM)
Stata will run a linear regression on any variables you give to it: as long
as there is one Y variable and at least one X variable, the regression
coefficients will be generated. However, to get sensible results out of the linear regression, you need to put
sensible things into it!
Let's try this example. Suppose there is a particular group of individuals (characterized
by age group, gender, or district of residence perhaps) that we suspect is not
getting enough information about HIV/AIDS. There is a question in the survey
which asks "Have you heard or seen information about HIV/AIDS?", and the answers
are captured as Yes==1, No==2 and User-missing==7. Open
bais.dta to check this.
First, let's get the variable into shape to serve as a dummy variable:
lab def yesno 0 "NO" 1 "YES"
tab hivinfo, nol recode hivinfo (2=0) (7=.) lab val hivinfo yesno
We'll also create a female dummy:
gen female=. replace female=0 if gender==1 replace female=1 if gender==2 lab def female 0 "Male" 1 "Female" lab val female female
keep if rec_per==1
Since we are going to focus on individual-level information, we can select only those observations for which the individual questionnaire was answered. Note that, if we need any household level variables in our analysis (like household size), we would need to re-open the data and start with the entire sample again.
As a preliminary investigation, we might ask: do females have more information about HIV than males? Do people in rural areas have less information? Do individuals with more education also have more information? We can investigate some of these bivariate relationships using simple tab, sum commands:
tab educ, sum(hivinfo)
number of | Summary of have you heard or seen
years in | info about HIV
school | Mean Std. Dev. Freq.
------------+------------------------------------
1 | .77142857 .42604296 35
2 | .6025641 .49253502 78
3 | .6173913 .48815098 115
4 | .67379679 .47008123 187
5 | .63829787 .48177621 188
6 | .64705882 .47876551 272
7 | .6916221 .46223568 561
8 | .6984127 .46016604 189
9 | .72508591 .44685504 582
10 | .71717172 .45094341 396
11 | .77027027 .42353041 74
12 | .82716049 .37869332 324
13 | .82352941 .38501337 51
14 | .95454545 .21070705 44
15 | .89583333 .30870928 48
16 | .96296296 .19245009 27
17 | .88235294 .32703497 34
18 | 1 0 8
19 | .83333333 .40824829 6
20 | 1 0 9
21 | 1 0 3
22 | .66666667 .57735027 3
23 | 1 0 1
25 | 1 0 1
------------+------------------------------------
Total | .72002472 .44905616 3236
tab location, sum(hivinfo)
| Summary of have you heard or seen
location of | info about HIV
household | Mean Std. Dev. Freq.
------------+------------------------------------
Urban | .80585366 .39573517 1025
Urban Vil | .708061 .45490223 918
Rural | .67206478 .46959691 1729
------------+------------------------------------
Total | .71840959 .44983593 3672
tab female, sum(hivinfo)
| Summary of have you heard or seen
| info about HIV
female | Mean Std. Dev. Freq.
------------+------------------------------------
Male | .72948328 .44436185 1645
Female | .70942279 .45414077 2027
------------+------------------------------------
Total | .71840959 .44983593 3672
All of these commands gives us the proportion of the people in that category who answered YES to the question.
For example, in the table above, almost 71% of females have heard or seen
information about HIV. This is the beauty of
dummy variables: any average which is generated out of values of 1's
(yes's) and 0's (no's) will be the proportion of 1's in your sample. [Note that
we would have generated very different results if we had not first recoded the
hivinfo variable to be a dummy
variable.] These tabs
will give us some picture about the factors affecting HIV knowledge, but we want
to answer the question more rigorously, controlling for all other variables we
think might matter at the same time.
Remembering what we know about how to generate dummy variables out of
categorical variables, let's run the following regression. We are not saying
here that being of a certain age or education determines whether you have
information about HIV or not. Rather you can think of the regression output as
telling us how much more or less likely a person is to have information about
HIV, given that they have a certain characteristic (live in a certain area, is
of a certain age etc).
xi: reg hivinfo age female educ i.location
Source | SS df MS Number of obs = 3236
-------------+------------------------------ F( 5, 3230) = 32.74
Model | 31.4682868 5 6.29365736 Prob > F = 0.0000
Residual | 620.874111 3230 .192221087 R-squared = 0.0482
-------------+------------------------------ Adj R-squared = 0.0468
Total | 652.342398 3235 .201651437 Root MSE = .43843
------------------------------------------------------------------------------
hivinfo | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0044326 .0006194 7.16 0.000 .0032182 .005647
female | -.0156337 .0156182 -1.00 0.317 -.0462562 .0149888
educ | .0171452 .0023696 7.24 0.000 .0124992 .0217913
_Ilocation_2 | -.0772342 .0208552 -3.70 0.000 -.118125 -.0363434
_Ilocation_3 | -.1056572 .018639 -5.67 0.000 -.1422026 -.0691117
_cons | .538162 .031462 17.11 0.000 .4764744 .5998495
------------------------------------------------------------------------------
Before we look at the output, remember how we interpreted coefficients in the
linear regression model with continuous variables? We said, for example, for
each year of additional education that an individual has, the age of first sex
changes by bed years, where bed
is the estimated coefficient on education. Here, the outcome
variable is either YES (1) or NO (0). So we cannot interpret the coefficients as
the change in Y for a given change in X, because Y can only take on two values.
Instead, we will think about how a change in each variable affects the
PROBABILITY that a person reported YES or NO.
For example, the _Ilocation_3 variable has a coefficient of
-0.10. This means
that compared to the base group (location==1, or urban area), someone living in
a rural area is 10% less likely to report YES, that they have heard or seen some
HIV information. Consider the age coefficient: it tells us that for each
additional year a person has, the probability of them having heard some
information about HIV rises, by 0.44%. A twenty-five year gap between two
otherwise identical individuals will mean that the older person is about
25*0.004=0.1==> 10% more likely to report YES.
This is why linear regression with a dummy dependent variable is known as a
Linear Probability Model - because the predicted outcome is a probability
number, while the form of the relationship between the X and Y variables is
assumed to be linear.
Notice the R-squared here; it's really small. This means that we cannot explain
much of the variation in the YES/NO answer to this question, if we use these
particular X-variables. However, since we are using a dummy dependent variable,
we should not expect the R-squared to be very high. Why is this? Since the
actual values are only ever 0 or 1, and the predicted values can be ANY number,
including non-integers, the fit of the regression line to the actual data will
never be exceptionally good. We'll see a picture further down, which will
clarify this explanation.
Before we move on, are there any other possible X-variables which might play a
role in affecting whether an individual has heard/seen any information about
HIV?
Now, to illustrate the problem with linear regression in the context of
dummy-dependent variables, let's predict the outcome variable for each
individual:
predict infohat
summ infohat
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
infohat | 3448 .7127905 .1016177 .4783424 1.163583
Note that the mean of infohat is 0.713. What information are we getting from this statistic? Simply that the predicted proportion of individuals in the sample who answered yes to the question is about 0.71. The actual mean in the sample (the proportion who actually did answer YES) was also about 0.71.
Now, remember that the outcome variable could
only take on values of 0 and 1. What has happened in the predictions? We have
predicted values of PROBABILITY which are greater than 1!! The maximum value for
infohat,
which is a probability number, is 1.1635. In fact, with certain values for some
of the X-variables, we might have predicted a negative
infohat.
Neither of these situations is plausible (probability greater than 1, or
negative), and this is the central problem with using linear regression for
dummy dependent outcome variables.
Let's try to see this problem with another example, which deals with how
individual characteristics and access to information affects attitudes. Suppose we want to know what
factors affect whether an individual would want an HIV-positive teacher to remain
teaching in school. What kinds of X-variables do you think will matter?
Let's do some cleaning first. The variable we need is
hivteach:
tab hivteach, nol recode hivteach (2=0) (9=.) lab val hivteach yesno
xi: reg hivteach age female educ i.location predict teachhat
Source | SS df MS Number of obs = 3077
-------------+------------------------------ F( 5, 3071) = 149.68
Model | 144.235258 5 28.8470516 Prob > F = 0.0000
Residual | 591.83949 3071 .192718818 R-squared = 0.1960
-------------+------------------------------ Adj R-squared = 0.1946
Total | 736.074748 3076 .239296082 Root MSE = .439
------------------------------------------------------------------------------
hivteach | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0059667 .0006419 9.30 0.000 .0047082 .0072252
female | .087467 .0160206 5.46 0.000 .0560547 .1188792
educ | .0545308 .0024291 22.45 0.000 .0497681 .0592936
_Ilocation_2 | -.0953094 .0213145 -4.47 0.000 -.1371015 -.0535172
_Ilocation_3 | -.1068599 .0191193 -5.59 0.000 -.1443478 -.0693719
_cons | .017347 .0322079 0.54 0.590 -.0458043 .0804983
------------------------------------------------------------------------------
How do we interpret these coefficients?
Each coefficient gives us the rate of change in the conditional probability of a YES response, for a given unit change in the explanatory variable. So,
- being female increases the probability that you will answer YES to the question, by almost 9%
- being in an urban village makes you 9% less likely to respond YES, while being in a rural area makes you 10% less likely to respond YES, relative to individuals living in urban areas
- each additional year of education increases the probability of you responding YES by
5%
If we look at the predicted values from our regression:
summ teachhat
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
teachhat | 3448 .5844359 .2233902 .0246846 1.560158
We find the maximum prediction is again above 1. There is nothing about the
structure of the LPM which restricts predicted values to be within suitable
bounds for probabilities: between 0 and 1. Let's see this in graphical form.
sort hivteach teachhat
gen num=_n /*this generates
a numerical label for each prediction*/
scatter hivteach teachhat num
scatter hivteach teachhat num, ylabel(0(0.25)1.5) xlabel(0(500)4000) ti("Fig
1: responses, & predicted probabilities from the LPM")
The graph without the frills looks like this:

Figure 1
While if we run the second version, we get this more polished graph:

Figure 2
What are the two flat blue lines in the graph? They represent the actual answers that people gave to the question: "Should HIV positive teachers be allowed to continue working in schools?" These responses could only be coded as 0 or 1. They are two flat lines in the graph, because we sorted the data first, by hivteach and teachhat. The sort command in Stata will sort the observations along whichever variable you specify first: in this case, all the 0's were ordered before all the 1's, and then within each group (the 0's, and 1's), the data was further sorted by the value of teachhat. You can check this is the case by doing
browse hivteach teachhat, nol
Notice that some of the predicted values for Y are outside of the [0,1] range. We could adjust these particular values by hand: that is, for all those observations which have predicted values above 1, set their predictions equal to 1 (and similarly, set the prediction equal to 0 if the prediction is negative). Alternatively, we could choose to use a model which does not allow any predictions to be outside of the [0,1] interval. There are indeed several widely-used models which solve this problem for us. They are what we turn to next. We will come back to how the LPM can still be useful, in conjunction with any of the following alternatives.
Before we continue though, try the following questions.
- 2. Is there reason to expect that individuals would have different attitudes on the hivteach question, depending on which district they live in? Are there some districts you would expect to be more accepting of teachers with HIV? Test your prior expectations, by running a version of the above model which uses the district variable instead of location. Do we get much more information out of running the regression in this way, or are location and district highly correlated?
- Question 2 Answer
- 3. Do you think that students would have particularly strong feelings for or against allowing HIV positive teachers to continue in school? Create a student dummy variable using the information in labor market status (lmstat), and include this as a regressor in your linear probability model. Are students more or less likely to answer the question in the affirmative? Is this relationship between being a student and the response to the question a statistically significant relationship?
- Question 3 Answer
BASIC ISSUES IN DISCRETE OUTCOME ANALYSIS
Instead
of modeling the relationship between X and Y as linear, we want a function that
will impose a non-linear relationship between the independent and dependent
variables, and will restrict the predicted values of Y to be between 0 and 1.
The functions we will use are specific cases of cumulative distribution
functions, or CDF's. If you are not familiar with what a CDF is, please
click here. The logit model is based on the
logistic distribution function, while the probit model is based on the familiar
normal distribution function. Both of these models have attractive properties
which the LPM does not:
- as X increases, Y increases but never moves outside of the [0,1] interval; and
- the relationship between the Y variable and X independent variables is assumed to be non-linear. That is, when there is a positive relationship between X and Y, for a particular observation, Yi approaches 0 more and more slowly as Xi gets small, and Yi approaches 1 more and more slowly as Xi gets larger and larger. In calculus-speak, this means dY/dX is not constant.
It is perhaps easiest to see what is going on with each of these models, in
contrast to the LPM, in the figure below:

[Jump forward to discussion of model selection if you are revisiting this graph.]
Notice that the way our data points are arranged, there is a positive relationship between X and Y: higher values of X are associated with higher values of Y, even though Y can only be 0 or 1. The LPM simply fits a straight line through the points, and so may generate predicted values which are outside of the [0,1] range for probabilities. The logit and probit, on the other hand, 'squash' the relationship between X and Y to fit inside the [0,1] range. Consider the right hand extreme of these lines: as we approach Y=1, any changes in X have almost no impact on the probability of Y =1, since Y is already almost there. In this way, the logit and probit models generate sensible predictions for probability numbers.
The main difference between the logistic CDF and
the normal CDF is that the former has slightly fatter tails (there is more mass
in the tails of the distribution that in the normal distribution). The choice of
which to use is generally a matter of taste or convenience of interpretation. We
will return to a discussion of choice of model below. However, before we go into
the probit model in more detail, it useful to explicitly write out the form of
the dummy-dependent variable models we have been considering.
The model set up for the LPM was as follows:
Y = bX + e
(A)
Y = 0 if response is NO
= 1 if response is YES
X = independent variable(s)
b = coefficient on the independent variable(s)
e = stochastic error term
Recall that the slope of the LPM line in the diagram is simply b (if we have just
one independent variable).
In contrast, the set-up for both of the non-linear models is the following:
Y = F(Xb) + e
(B)
Y = 0 if response is NO
= 1 if response is YES
X = independent variable(s)
b = coefficient on the independent variable(s)
e = stochastic error term
F(.) = a CDF, which takes the value of Xb
and transforms it into a probability number.
This is not the place to go into any more detailed theory, but if you have never
seen probit or logit models before, there are several references at the end of
this module which you might want to take a look at. At this point, what we need
to know is that Stata will perform whichever model you ask it to, it will use a
technique called maximum likelihood estimation to retrieve the coefficients you
want, and there will be some difference in the way you interpret the output,
compared to how you did this in the linear regression framework.
Let's go to some examples.
In this model, the CDF we use [F(.) in equation B above] is the normal distribution function. We'll use the probit to analyze the effect of age, education, gender, urban residence and knowledge about HIV on the probability of an individual responding YES to the question "Should an HIV positive teacher be allowed to continue teaching?". We will stay with this example for the moment, to compare our results to the results from the LPM.
char location[omit] 3 xi: probit hivteach age educ female i.location hivinfo
i.location _Ilocation_1-3 (naturally coded; _Ilocation_3 omitted)
Iteration 0: log likelihood = -2063.6213 Iteration 1: log likelihood = -1720.5449 Iteration 2: log likelihood = -1704.3112 Iteration 3: log likelihood = -1704.2015 Iteration 4: log likelihood = -1704.2015
Probit estimates Number of obs = 3074
LR chi2(6) = 718.84
Prob > chi2 = 0.0000
Log likelihood = -1704.2015 Pseudo R2 = 0.1742
------------------------------------------------------------------------------
hivteach | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0192975 .0020391 9.46 0.000 .0153008 .0232941
educ | .1853501 .0092052 20.14 0.000 .1673082 .2033919
female | .2409078 .0503665 4.78 0.000 .1421912 .3396244
_Ilocation_1 | .3112355 .0609569 5.11 0.000 .1917622 .4307089
_Ilocation_2 | .0237823 .0607588 0.39 0.695 -.0953028 .1428674
hivinfo | .1952305 .055225 3.54 0.000 .0869916 .3034695
_cons | -2.082644 .1052039 -19.80 0.000 -2.288839 -1.876448
------------------------------------------------------------------------------
Below are some notes
related to the probit output:
Probit Output - Note 1
By using the char varname[omit]
number command, we ensured that the base (or omitted) category for area of residence is
rural. The individual who represents the base category for the regression is a
male individual, with no education and zero years of age, who has not heard or
seen any information about HIV.
Probit Output - Note 2
Before we get the output table, there are 5 lines of "Iterations".
Stata will
do this each time a model is 'solved' using maximum likelihood techniques. To
read more about maximum likelihood techniques,
click here.
Probit Output - Note 3
Interpreting coefficients:
(i) SIGN: since all the variables have positive coefficients, we can say that
each of these attributes increases the likelihood of an individual reporting
YES, relative to the base or reference group.
(ii) SIZE: It is hard to interpret probit coefficients, because of the structure
of the model. Each coefficient adds to (or subtracts from, if the sign is
negative) the underlying number - or Z-score - which is then fed into the normal
CDF to produce a probability number. We care about the probability number for
interpretation, but we see the underlying Z-score in the probit output.
If we look back at the graph with the normal CDF, we can see that if there is an
increase in Xb of 0.1 at either tail, the impact on the probability is really
small. However, if we change Xb by 0.1 towards the middle of the distribution,
the impact of this change on the probability will be large. Thus, to interpret
the effect of a change in an X variable on the probability of the outcome
happening, we have to specify what Xb value we are starting from. The crucial
thing to note is that the marginal effect of a unit change in an X variable is not the same, for every value of X. In a moment,
we will learn a command which tells Stata to calculate the effects of a change
in each of the independent variables, when we start from the mean values of each
X variable. Before we get there though, let us do one interpretation of a
coefficient from the probit output - the hard way!
Suppose we have an individual who is 20 years old, has 10 years of education, is
male, and has not heard any information about HIV. We want to know what the
impact of moving from a rural to an urban area is, for this particular person.
Using the model we wrote out above in equation (B) we can generate a
predicted probability for this individual, in each of these two states of the
world: when he is in the rural area, and when he is in the urban area.
Yhat[for our man in the rural area]
= F(Xbhat)
= F(bage*20 +beduc*10 + bfemale*0 + burban*0 + burban village*0 + bhivinfo*0 + constant)
= F(.0192975(20)+.1853501(10)+.2409078(0) +.3112355(0) + .0237823(0) +.1952305(0) -2.082644)
= F(.0192975(20)+.1853501(10) -2.082644)
=F(.156807)
To evaluate this, we can ask Stata to read this z-score, and give us back the associated probability number from the standard normal distribution:
di norm(.156807)
.56230152
This tells us that such an individual has a
56% chance of responding YES to
the question. What is the marginal effect of urban area? Using the same
approach, we can calculate:
Yhat[for our man in the urban area]
= F(Xbhat)
= F(bage*20 +beduc*10 + bfemale*0 + burban*1 + burban village*0 + bhivinfo*0 + constant)
= F(.0192975(20)+.1853501(10)+.2409078(0) +.3112355(1) + .0237823(0) +.1952305(0) -2.082644)
= F(.0192975(20)+.1853501(10)+.3112355(1) -2.082644)
= F(.4680425)
Again, to get the probability associated with this z-score, we type:
di norm(.4680425) .6801229
The difference between these two probabilities represents the marginal effect of location for this male individual of age 20, with 10 years of education and no information about HIV:
di .6801229-.56230152
.11782138
The marginal effect, for this type of individual, of moving from a rural to an urban area is an increase in the likelihood of answering YES, by 12%. This doesn't look ANYTHING like the 0.31 coefficient on _Ilocation_1!
This is a very tedious way to interpret the probit coefficients, but the example is a warning that the probit output is not as straightforward to think about as the linear regression output.
We can check that this is consistent with results from the LPM model by doing the following:
xi: reg hivteach age female educ i.location
Source | SS df MS Number of obs = 3077
-------------+------------------------------ F( 5, 3071) = 149.68
Model | 144.235258 5 28.8470516 Prob > F = 0.0000
Residual | 591.83949 3071 .192718818 R-squared = 0.1960
-------------+------------------------------ Adj R-squared = 0.1946
Total | 736.074748 3076 .239296082 Root MSE = .439
------------------------------------------------------------------------------
hivteach | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0059667 .0006419 9.30 0.000 .0047082 .0072252
female | .087467 .0160206 5.46 0.000 .0560547 .1188792
educ | .0545308 .0024291 22.45 0.000 .0497681 .0592936
_Ilocation_1 | .1068599 .0191193 5.59 0.000 .0693719 .1443478
_Ilocation_2 | .0115505 .0196732 0.59 0.557 -.0270235 .0501245
_cons | -.0895129 .0283406 -3.16 0.002 -.1450812 -.0339445
------------------------------------------------------------------------------
Notice that the urban effect is about
11%, which is close to our
calculation of 12%.
(iii) SIGNIFICANCE: all of the coefficients in the probit, except for _Ilocation_2, are statistically significant at the 1% level. This means that they are all significantly different from zero, except for urban villages.
[Note: if your probit results have disappeared from your screen, just type probit again, and the results from the most recent probit model will re-appear! However, this will only work if you have not run any other regression or model in between.]
Probit Output - Note 4
We can see from the predict and sum commands, that the probit has done what we have required: it has not predicted values for teachhat which are outside of the [0,1] range. Before you run the predict command, remember to re-run the probit command, otherwise Stata will predict from the LPM that we have just run!
xi: probit hivteach age educ female i.location hivinfo
predict teachhatprob summ teachhatprob
Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- teachhatprob | 3236 .6020995 .2246865 .0560828 .99988
We can use very similar syntax to before, to graph out the predicted values from the probit and the LPM, together with the underlying data.
predict teachhatprob summ teachhatprob corr teachhatprob teachhat
| teachh~b teachhat
-------------+------------------
teachhatprob | 1.0000
teachhat | 0.9626 1.0000
Here, we can see that the predictions from the LPM (teachhat) and from the probit (teachhatprob) track each other well, with a correlation of 0.96.
scatter hivteach teachhat teachhatprob num, ti("Fig 3: Actual and predicted values of HIVTEACH")
Figure 3
Notice that the green dots, which are from the probit predictions, all lie within the [0,1] range, while the red dots, which are all from the LPM predictions, move outside of the [0,1] range.
One way to see how the marginal effects are different for different values of
X (which does not involve calculating Z-scores and probability numbers by hand
for every observation!) is to use a graph to investigate. Suppose we want to
know the effect of getting HIV information, for those individuals who report
they have not heard or seen any information about the disease. So, for those
people who currently have hivinfo==0, what happens to their predicted
probability of reporting YES to the hivteach question, when we change
their hivinfo variable to =1?
We've already generated a prediction for the whole sample; it was called
teachhatprob. Now we need to generate another prediction, where everyone has
their hivinfo variable set to = 1. First, re-do the probit, and then
follow this syntax:
char location[omit] 3 xi: probit hivteach age educ female i.location hivinfo
gen teachhat2=norm(-2.082644 + age*.0192975 + educ*.1853501 + female*.2409078 + _Ilocation_1*.3112355+ _Ilocation_2*.0237823 +.1952305*1)
What we are telling Stata to do is generate a predicted probability for each
observation, using each observation's actual values for age, education, female
and location, but instead of using the actual value for
hivinfo, let EVERYONE in our sample have a
value of 1 for hivinfo. This means, do the
prediction for everyone, under the assumption that they all report yes to the
question "Have you seen or heard any information about HIV?.
What we are going to graph is all the predicted probabilities for those people
who report hivinfo==0, at the value of hivinfo==0, and then again the
predictions for these individuals when we set their hivinfo==1."
Now, we can graph each of these predictions against the probability predictions for only those individuals who report hivinfo=0:
line teachhatprob teachhat2 teachhatprob if hivinfo==0, c(l l) sort line teachhatprob teachhat2 teachhatprob if hivinfo==0, c(l l) sort xline(0.3 0.7)

Figure 4
The blue line in the above graph is a 45 degree line - it plots the predicted
probabilities from the probit for those who have hivinfo==0 against the
same probabilities. The red line indicates what the probability prediction would
be for these same individuals if, instead, they had hivinfo==1. You can
clearly see that giving information to those who have no information about HIV
has larger effects for those who start out with predicted probabilities between
0.3 and 0.7 (approximately!).
Probit Output - Note 5
Other output from the probit command: the likelihood ratio test (LR test) is a test for the joint significance of all coefficients in the model. The null hypothesis in the test is that all coefficients are jointly zero. The test statistic which is generated from the probit is given by the LRchi2(6) line, where 6 represents the degrees of freedom used up in the model. In our case, we have 6 explanatory variables, and so 6 degrees of freedom.
The Prob(chi(2))>0 in the upper right hand side of the output table tells us that the outcome of the LR test is significantly different from zero, which means we reject the null that the coefficients are all jointly zero. We can be confident that our model is explaining some part of the variation in responses to this question.
A Simpler Way To Derive Marginal Effects From The Probit
One command which we can use to make interpretation a little easier, is the dprobit command. This command will also fit the probit model, but instead of reporting the raw coefficients as in the table above, Stata reports the change in the probability for an infinitesimal change in each independent, continuous variable and, by default, the discrete change in the probability for dummy variables. These impacts are evaluated at the mean values of X. Click here for more detail on how this is done.
Instead of asking Stata to run the probit model, we substitute dprobit in the command line:
xi: dprobit hivteach age educ female i.location hivinfo
Probit estimates Number of obs = 3074
LR chi2(6) = 718.84
Prob > chi2 = 0.0000
Log likelihood = -1704.2015 Pseudo R2 = 0.1742
------------------------------------------------------------------------------
hivt~h | dF/dx Std. Err. z P>|z| x-bar [ 95% C.I. ]
---------+--------------------------------------------------------------------
age | .0072661 .0007669 9.46 0.000 25.8351 .005763 .008769
educ | .0697898 .0034082 20.14 0.000 8.35198 .06311 .07647
female*| .0909303 .0190248 4.78 0.000 .557254 .053642 .128218
_Iloca~1*| .1141349 .0216304 5.11 0.000 .300586 .07174 .15653
_Iloca~2*| .0089372 .022787 0.39 0.695 .262199 -.035724 .053599
hivinfo*| .0744483 .0212671 3.54 0.000 .727716 .032766 .116131
---------+--------------------------------------------------------------------
obs. P | .6040989
pred. P | .6330931 (at x-bar)
------------------------------------------------------------------------------
(*) dF/dx is for discrete change of dummy variable from 0 to 1
z and P>|z| are the test of the underlying coefficient being 0
Here, we can interpret the coefficients directly as marginal effects, as we did before in the linear probability model. However, these are marginal effects on the probability of reporting YES for the individual with the mean values of each X:
- a female is about 9% more likely to give a YES response, than a man, controlling for all other X's at their means
- an extra year of education increases the chances of you reporting YES by almost 7%, compared to an otherwise identical individual with one year less education
- the largest effect seems to be from place of residence. Compared to someone living in an urban village or rural area, an individual living in an urban area is over 11% more likely to answer YES to the question. This is not quite the same as our earlier prediction, because we were not calculating the impact for the 'average' person, but rather a specific male individual with 20 years of age, 10 years of education, and no knowledge about HIV. However, it is pretty close, and is also consistent with the 11% marginal effect that we found using the LPM.
- again, all of the coefficients are significant, except for the second location variable. This is identical to the probit output, which we should expect, because dprobit merely presents the output from the probit, in a more 'edible' form!
The downside to using the dprobit command is that there may not actually be an
individual, or a group of individuals, with mean values of X. This means that
the marginal effects that you report from this output table are not relevant to
the actual observations in your data set. One way to deal with this is to
graphically look at how your sample is distributed within each X-characteristic.
If much of the sample is clustered around the mean values of each X, then you
are probably safe to use dprobit.
What have we learned from this exercise?
Women are more likely than men to say that HIV positive teachers should remain
teaching, as are individuals who have prior information and knowledge about HIV. More education
makes one more likely to answer YES to the question, whereas someone living in a
rural area is much less likely to want to allow infected teachers to stay in
school.
Now it's your turn, answer the following question:
- 4. Use the probit model to analyze what factors make an individual more likely to answer YES to the question Q514 "Can people get HIV/AIDS because of witchcraft?". What is the marginal effect of hivinfo in this model?
-
Question 4 Answer
- An Alternative Way To Think About
The Probit Model
- In many situations, the probit model can be given a
latent-variable interpretation. We review this here, because it is sometimes
clearer to motivate the use of a probit model in terms of a latent dependent
variable. Let's continue with our example of what factors are
important for explaining why an individual would want an HIV positive teacher not to
continue teaching in school.
The latent variable model is set up as follows:
I = Xb + e
Y = 1 if I > c
= 0 if I <= c
So, the Pr(Y=1) = Pr(I>c)
Pr(Y=0) = Pr(I<=c)
where:
I is the latent variable, which we cannot observe
X is our usual set of independent variables
Y is the outcome we care about; in this case, the YES answer to our question. We
can observe this.
c is some cut-off point, or hurdle value.
The motivation for the model is straightforward enough: each individual's
X-factors make them more or less likely to answer
YES to the question. We can't see how likely they are to answer YES {the I
variable), we only
observe the outcome, yes or no response {the Y variable}. But we can model
the probability of a YES answer, by thinking about the probability of the latent
variable being above or below a cutoff
point. In the case we have been considering, Pr(Y=1) = Pr(I>c) = F(Xb) where
F(.) is the standard normal CDF - the function
which translates our underlying latent variable number into a suitable
probability between 0 and 1.
More detail on using the latent variable motivation of the probit model is
provided in the Greene (2003) and Johnston and DiNardo (1997) references provided at the end.
Right now, let's turn to another model for dummy-dependent variables: the logit
model. The use of the logit model may also be motivated in terms of this
latent-variable approach.
In this model, the CDF we use [F(.) in
equation {B} above] is the logistic distribution function. This
distribution function is a little easier to see written down than the standard
normal distribution:
Y = F(Xb) + e
where
F(Xb) = exp(Xb)/(1+exp(Xb)) = eXb/(1+eXb)
(C)
exp = natural e
We'll use the logit to analyze the effects of age, education, gender and urban
residence on the probability of an individual responding YES to the question
Q309_1: "Did you use a condom the first time you had sex with your most recent
partner?"
The variable we need is called
part1con1, and it needs some
initial cleaning up:
tab part1con1 tab part1con1, nol recode part1con1 (2=0) lab val part1con1 yesno
We will also generate an interaction variable for education*gender:
gen femeduc=female*educ
char literacy[omit] 3 char location[omit] 3
xi: logit part1con1 age female i.location femeduc educ i.literacy
i.location _Ilocation_1-3 (naturally coded; _Ilocation_3 omitted) i.literacy _Iliteracy_1-3 (naturally coded; _Iliteracy_3 omitted)
Iteration 0: log likelihood = -1139.6791 Iteration 1: log likelihood = -845.30504 Iteration 2: log likelihood = -832.06356 Iteration 3: log likelihood = -831.79186 Iteration 4: log likelihood = -831.79169
Logit estimates Number of obs = 1739
LR chi2(8) = 615.77
Prob > chi2 = 0.0000
Log likelihood = -831.79169 Pseudo R2 = 0.2702
------------------------------------------------------------------------------
part1con1 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.1287028 .0071621 -17.97 0.000 -.1427402 -.1146655
female | -1.33864 .3525716 -3.80 0.000 -2.029667 -.6476121
_Ilocation_1 | .1567167 .1425106 1.10 0.271 -.1225989 .4360323
_Ilocation_2 | .1860074 .158996 1.17 0.242 -.125619 .4976338
femeduc | .0934957 .036732 2.55 0.011 .0215024 .1654891
educ | .0273958 .0254095 1.08 0.281 -.022406 .0771975
_Iliteracy_1 | 1.271578 .4779464 2.66 0.008 .3348202 2.208336
_Iliteracy_2 | .7676351 .4840312 1.59 0.113 -.1810485 1.716319
_cons | 3.422056 .5161311 6.63 0.000 2.410457 4.433654
------------------------------------------------------------------------------
Below are some notes related to the logit output:
Logit Output - Note 1
Again, we see the iteration steps that Stata goes through in order to get
the estimated coefficients. This is because
Stata finds the logit coefficients using maximum likelihood techniques.
Logit Output - Note 2
Interpreting coefficients:
(i) SIGN: older individuals, and women, are much less likely to report that they
used a condom the first time they had sex
with their most recent partner. More literacy is particularly associated with an
increased probability of reporting use of a
condom at first sex with most recent partner. This positive effect is present
for the female*education interaction term (that
is, more educated women are more likely to report YES than less educated women
and men educated at the same level) and both
of the non-rural variables.
(ii) SIZE: there are a couple of ways that we can report logit coefficient, but
none are as easy as using the dprobit
command! Click here for a calculus version of
generating these marginal effects.
The rule we will use is the following:
dY/dX = b*F(Xb)*(1-F(Xb))
where F(Xb)
= probability that response was YES
(1-F(Xb)) = probability that response was NO
b
= the coefficient of interest.
Now, we could use a number of values for F(Xb):
(a) We could calculate the marginal effect of a change in one X variable, where
we use the sample proportion of actual YES
answers in place of F(Xb), and the sample proportion of actual NO answers in
place of (1-F(Xb)).
sum part1con1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
part1con1 | 2054 .5754625 .494393 0 1
Here, the sample mean is .58. We can do a back-of-the envelope calculation to
find the marginal effect of being literate
(_Iliteracy_1==1) on the probability of responding YES:
. di 1.271578*.5754625*(1-.5754625) .31065339
Thus, the average increase in probability of reporting YES for the sample
under consideration is about 31%! Literacy seems to
matter a lot for determining whether a condom is used for the first time an
individual has sex with their most recent partner.
We will check more rigorously for statistical significance below.
(b) Another way to evaluate the marginal effects of a change in one of the X
variables is to use the mean of the predicted
values for F(Xb) instead of the actual sample mean.
predict part1con1hat summ part1con1hat
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
part1con1hat | 3445 .7289634 .281968 .0039366 .9797294
Notice that the predicted mean is rather different from the sample mean. In this case, the marginal effect of being literate is:
di 1.271578*.7289634 *(1-.7289634 ) .25123299
This prediction is very different to
the first, because the values of X at which we are predicting are different. If
we
wanted to, we could evaluate F(Xb) at the smallest value of X, or the largest
values of X, or the mean values of X, and find
different predictions in each case. Your choice of which marginal effect to
report should be guided by the point of your
study. However, many researchers will discuss marginal effects of particular
interest for mean values of X in the
distribution.
iii) A third way of interpreting the output is in terms of odds ratios. An odds
ratio tells us how much more likely it is for
an individual to report YES, than to report NO. Thus, we can write:
Pr(YES)/Pr(NO)
= p/(1-p)
= F(Xb)/(1-F(Xb))
= exp(Xb) (**) where p = our shorthand for Pr(YES)
F(Xb) = probability of reporting YES based on our underlying score, Xb
exp(Xb) = the expression in (**) evaluated using the logistic CDF given in (C) above.
In short, if we wanted to know how much more likely a literate person was to report yes than to report no, we could simple exponentiate the beta-coefficient on _Iliteracy_1:
di exp(1.271578) 3.566476
Thus, a literate person is about 3 and
1/2 times more likely to report YES, they had used a condom the first time they
had
sex with their most recent partner, than they are to report NO. If you are
familiar with the log function, there is another interpretation you might
appreciate if you click here.
Reporting odds ratios is one command that Stata can do pretty easily. If we
append the logit command above with ,or (which stands
for odds ratio), we get the following:
xi: logit part1con1 age female i.location femeduc educ i.literacy, or
Logit estimates Number of obs = 1739
LR chi2(8) = 615.77
Prob > chi2 = 0.0000
Log likelihood = -831.79169 Pseudo R2 = 0.2702
------------------------------------------------------------------------------
part1con1 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .8792352 .0062971 -17.97 0.000 .8669793 .8916644
female | .2622021 .092445 -3.80 0.000 .1313792 .5232938
_Ilocation_1 | 1.169664 .1666895 1.10 0.271 .8846184 1.546559
_Ilocation_2 | 1.204431 .1914997 1.17 0.242 .8819508 1.644825
femeduc | 1.098006 .0403319 2.55 0.011 1.021735 1.17997
educ | 1.027774 .0261153 1.08 0.281 .9778432 1.080255
_Iliteracy_1 | 3.566476 1.704584 2.66 0.008 1.397689 9.100557
_Iliteracy_2 | 2.154665 1.042925 1.59 0.113 .8343949 5.564008
------------------------------------------------------------------------------
Notice that the coefficient on
_Iliteract_1 is the one we have just calculated above!
So interpreting the rest of these coefficients should be straightforward. For
example, the odds of females reporting YES are about .26 times higher than the
odds of them reporting NO, whereas living in an urban area makes an individual
1.16 times more likely to report YES than NO.
(iii) SIGNIFICANCE: age, female and the literacy variable are all significant at
the 1% level. This means we are sure that
99% of the time, our estimates will be significantly different from zero. This
might be something to worry about. Are there
other variables you can think about which might affect whether someone uses a
condom the first time they have sex with a most
recent partner? How about their current marital status?
Try the following exercises to be sure that you are comfortable with
interpreting logit output.
5.
Using the output from the initial logit regression (the one not in odds ratio
form), calculate the marginal effect of being female on the probability of
reporting YES to the question. Do this in each of the three ways we have
discussed, and be sure to think about how the
interaction term affects your calculation.
Question 5 Answer
6. What is the effect of adding a dummy variable into the logit model, where
the dummy ==1 if currently married, and ==0 if not?
Do any of the other coefficients change? How can you interpret this coefficient?
Is it statistically different from zero?
Question 6 Answer
Logit Output - Note 3
GOODNESS OF FIT: As in the probit analysis, the statistic used to test for
joint significance of all variables is the likelihood ratio (LR) test. The LR
statistic is presented in the upper right hand corner of the logit output table,
where you can also see the 8 degrees of
freedom (because we have included 8 independent variables in the model). The
Pr>chi(2) = 0.000 implies that we can decisively
reject the null hypothesis that all of the slope coefficients are not
significantly different from zero.
Another way to test how good the logit model (and for that matter the probit
model too) involves calculating the percent
of outcomes correctly predicted. Once we have predicted the values from the
model, we can count up how many are correctly
classified as YES's (prediction>0.5) and how many are correctly classified as
NO's (prediction<0.5).
Try the following syntax:
tab part1con1 count if part1con1hat>=0.5&part1con1==1 count if part1con1hat<0.5&part1con1==0
Now we can compute a weighted average of correctly predicted values from this information using the following formula:
Ave correct predictions
= (actual YES/total sample)*(predicted YES/actual YES) + (actual NO/total sample)*(predicted NO/actual NO)
= (predicted YES + predicted NO)/(total sample)
di "the weighted average of correctly predicted values is " 1056/2054 + 377/2054 the weighted average of correctly predicted values is .6976631
This means that in almost 65% of cases, the model predicts the outcome correctly. It is generally up to the researcher to decide whether this is a satisfactory prediction result or not.
MODEL SELECTION: LOGIT OR PROBIT?
The difference between the two models is minimal; if you consider the graph
of the logistic and standard normal distribution
above (example here), you can see the shapes of the two are very similar, and
identical in the middle. The main difference is that the logistic distribution
has slightly flatter tails - there is less probability mass at the end points of
the distribution. The choice of which model to use is really one of preference,
as both will provide similar estimates.
To see how similar the predictions from these two models are, we can return
to our hivteach example and run the same specification in the logit
framework, find the predictions and graph out the actual observed answers to the
question (hivteach), and the predictions from the LPM (teachhat),
the probit (teachhatprob) and the logit.
xi: logit hivteach age educ female i.location hivinfo
predict teachhatlog
lab var teachhatprob "Pr(hivteach) from the probit" /*this labels the variable so it looks good in the graph!*/
lab var teachhatlog "Pr(hivteach) from the logit"
lab var teachhat "Pr(hivteach) from the LPM"
scatter hivteach teachhat teachhatprob teachhatlog num, ti("Fig 5: Actual and predicted values of HIVTEACH")

Figure 5
The LPM, probit and logit predictions fall neatly almost on top of each other!
7. Consider the logit model you have just run for the question about
hivteach.
Interpret the marginal effect of moving to an urban area. Do you get similar
results to the probit? What about to the LPM?
Question 7 Answer
Why linear regression can sometimes be a useful sensitivity check
You have seen that interpreting the logit and probit model output is not always
easy. The betas in the LPM are much more
intuitive to think about. Even though the LPM has those problems mentioned
above, it is still sometimes useful to start off
using this model to analyze your data. Often, a paper will report coefficients
from an LPM and a logit model, or the LPM and
a probit model. If the specification is correct (the right X-variables are
included, and no extra unnecessary X's are
included), then the two models should not produce wildly different answers: the
betas should not all be switching sign, or
jumping around in magnitude. Thus, comparing the LPM to the logit or probit
output ( that is, if you compute the marginal
effects in each of the three models at the mean values of X) should serve as a
loose specification check, and using the LPM
to start out your research is a good diagnostic tool - if you find ridiculous
results with the LPM, most likely your results will
still be ridiculous when you turn to the probit or logit.
One more problem with dummy dependent variable models: heteroscedasticity
Without going in to too much econometric detail, it is important to raise the subject of heteroscedasticity. When a Y variable is a dummy variable, it can only take on two possible values, and this leads to problems of non-constant variance of the error term (if this sentence is Greek to you, ignore this section and rather read the introductory chapters on dummy dependent variables in a text book like Gujarati, given in the reference list below). The point is that if we run the LPM, logit or probit model without bearing this problem in mind, we will have incorrect standard errors in our output tables. They will be systematically underestimated. This means that we could interpret coefficients as significant, when in fact they may not be.
Stata can correct the standard errors for us, to deal with this heteroscedasticity, by using the robust command after your estimation syntax:
xi: logit hivteach age educ female i.location hivinfo, robust
You can compare your results from this table to the one without the robust command, and you'll notice that the only thing that changes are the standard errors: some of them become larger. While this correction works well in the LPM framework, the heteroscedasticity problem is a little more complicated in the logit and probit models. You are referred to chapter 13 in Johnston and DiNardo, for further details.
Now that you have covered the set of most popular models for dummy dependent variable analysis, use the LPM, the probit and the logit model to answer the following questions.
8. We want to know whether households which have been affected by family illnesses or deaths or an influx of orphans in the past year have access to support networks. In particular, we'd like to find out whether female-headed households are more or less likely to get support, whether the size of the household matters for getting assistance, and whether households in some parts of the country are more likely to be in the vulnerable no-support category.
Construct a measure of household support (using outsidehelp1, outsidehelp2, hhorph1), a measure of household size (using egen), a female head dummy (this is tough!), a proportion of workers in the household variable (using egen, work and hhsize) and a proxy for wealth (using toilettype==1 and transport1==1). Then, restrict your data to one observation per household, and run each of the models we have covered, using hhsupport as the dependent variable.
i) interpret the sign of your coefficients in each model. Are they consistent across models?
ii) interpret the significance of your coefficients. Is there consistency across models? {Note here, that the sample size for this household level analysis gets really small, and so significance becomes an issue. Generally, you want to be careful not to use very complicated models with very few data points, because it becomes more and more difficult to say anything about marginal effects with confidence.}
iii) Consider the female head variable in more detail. Interpret the marginal effect of having a female head in the household, on the probability of getting outside support for the household. Do this in each of the three models. Are your answers somewhat consistent? Is the sign and size of this coefficient something that you would have expected?
iv) Construct the predicted values from each of the models, and calculate the percentage correctly predicted by each model. Which model performs best, in terms of getting the highest proportion of correct predictions?
Greene, W. 1997. Econometric Analysis. (3e) Prentice-Hall. [There are also more recent 4th and 5th editions now available]
Gujarati, Dadomar. 2003. Basic Econometrics. McGraw-Hill. [Any edition of this introductory econometrics textbook is useful; some of the material for this module was based on content from the second edition, chapter 15: "Regression on dummy-dependent variables"]
Johnston, J. and DiNardo, J. 1997. Econometric Methods (international edition). McGraw-Hill. [Ch 13: Discrete and limited dependent variable models]
| BACK TO TOP |
Figure 3