Module 1: Introduction to Surveys
Module 2: Getting Started with Stata
Module 3: Understanding Distributions
Module 4: Measures of Central Tendency
Module 5: Bivariate Analysis
Module 6: Simple Regression Analysis
Module 7: Multiple Regression Analysis
Module 8: Discrete Outcome Analysis
Graphing with Stata 8

MEASURES OF CENTRAL TENDENCY AND VARIABILITY

 

TABLE OF CONTENTS

Introduction
Understanding Distributions of Continuous Variables
Means of Continuous Variables
Using the Summarize Command
Medians and Modes of Continuous Variables (Using Tabulate Command)
Measures of Dispersion - Variance and Standard Deviation
Understanding the Distributions of Categorical Variables
Combining Tabulate and Summarize
Exercises

 

 

 

 

 

 

 

INTRODUCTION

In Module 3, we learned about the different variable types that exist in the BAIS data set, the commands that enable us to see frequency distribution tables, and some basic graphing commands. Based on that, we are going to start learning how to do some basic statistical analysis, using measures of central tendency and variability.

There are many interesting questions we might want to investigate using the data on clinic visits by pregnant women (pregno and pregmon). For example:

  1. What is the average number of clinic visits while a woman is pregnant?

  2. Is the duration of pregnancy prior to a clinic visit somewhat equal or is the duration much higher for the women with lower levels of education? How much higher?

  3. How does the average duration of pregnancy prior to a clinic visit vary by location and does this timing vary within these different locations as well?

We will start this module considering continuous variables. Then we will learn some new commands that make analyzing data easier. Lastly, we will go through some of the key methods that will enable us to analyze categorical variables effectively.

 

UNDERSTANDING DISTRIBUTIONS OF CONTINUOUS VARIABLES

For now, let's focus on the variable pregmon, which represents the number of months a woman is pregnant when she first visits the clinic. Let's start by seeing what the distribution of pregmon looks like.

In the graph below, we find that over 30 percent of the observations are in the fourth bin (which represents woman who were four  months pregnant prior to their first clinic visit). Also note the shape of the pregmon distribution, it looks to be a fairly "normal" distribution.

For the graph, type:

# delimit ;

histogram pregmon, bin(10) percent
title("How Many Months Pregnant When First Visit Clinic")
xtitle("Months")
note("Source: Botswana AIDS Impact Survey")
ylabel(0(5)30, angle(horizontal)) ytick(0(5)30)
xlabel(0(1)10) xtick(0(1)10);

 

[Jump forward to discussion of medians and modes if you are revisiting this graph.]

 

MEANS OF CONTINUOUS VARIABLES

Let's consider the first question posed at the top of this module. What is the average number of clinic visits while a woman is pregnant? The average, or mean, value is defined as the sum of all values divided by the number of values. To compute this in Stata, type:

means pregno

    Variable |    Type        Obs        Mean       [95% Conf. Interval]
-------------+----------------------------------------------------------
      pregno | Arithmetic    1177    7.378929        7.182664   7.575195 
             |  Geometric    1177    6.670878        6.496657   6.849772 
             |   Harmonic    1177    5.913148        5.706759   6.135027 
------------------------------------------------------------------------

As we see, Stata plots three different types of means, the arithmetic, geometric, and harmonic mean. We will only be concerned with the arithmetic mean in our modules. In the BAIS data, the mean number of clinic visits was 7.37. If we listed other variables after pregmon, Stata would report the mean of these variables too. For example, we could have typed means pregmon age and we would have also learned what the average age of a respondent in the BAIS data was.

The means command can also be used with the qualifiers introduced in Module 2. For example, to learn the average number of clinic visits while pregnant for respondents who were 20 years old, we would type:

means pregmon if age==20

    Variable |    Type        Obs        Mean       [95% Conf. Interval]
-------------+----------------------------------------------------------
     pregmon | Arithmetic      36    4.777778        4.166541   5.389014 
             |  Geometric      36    4.425478        3.837442   5.103622 
             |   Harmonic      36    3.992606        3.317391   5.012927 
------------------------------------------------------------------------

We see that the average for 20 year old women is 4.77 visits.  Now it's your turn.

Try the following quick exercises:

1. What is the average years in school for respondents over the age of 40?
Question 1 Answer
2. What is the average age at first sexual intercourse for women in urban villages?
Question 2 Answer
3. Conditional on being older than 25, what is the average age of married respondents in the sample?
Question 3 Answer

 

 

USING THE SUMMARIZE COMMAND

There are other ways to compute the mean of a variable in Stata. One that is worth learning now is the command summarize. In Stata, type:

summarize pregmon

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
     pregmon |    1175    4.395745   1.816304          1         10

Again we see that respondent's average number of months pregnant when making the first visit to the clinic is 4.39 months. There are two nice features of summarize as opposed to means. First, summarize tells us the range of the variable. For example, when we typed summarize pregmon, we learned that pregmon ranged between 1 and 10. This information was not given when we typed means. Second, summarize works with the by() option. If we sort the households by location, educ, or any other distinguishing characteristic, we can compute the mean by each of those characteristics. First, we have to sort the data by the variable we intend to use in the by(). To sort by location, type:

sort location

Next, compute the mean of pregmon by location, type:

by location: summarize pregmon

_______________________________________________________________________________
-> location = Urban
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
     pregmon |     295    4.386441   1.974278          1         10
_______________________________________________________________________________
-> location = Urban Vi
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
     pregmon |     286    4.513986   1.985424          1          9
_______________________________________________________________________________
-> location = Rural
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
     pregmon |     594    4.343434   1.641545          1          9

 

We see respondent's average number of months pregnant when making the first visit to the clinic doesn't vary much by location. The average number of months in urban areas is 4.38, 4.51 in urban villages, and 4.34 in rural areas.

Sometimes, it will be helpful to compute means (or other statistics) by groups that we are interested in (as was the case when we examined means across the different locations). For example, suppose we wanted to use the summarize (or sum for short) command and we wanted to compute means by four age groups-- under 20, 21-30, 31-40, and over 40.  Now we could simply combine the summarize command with a qualifier for each of these different groups to get the desired estimates.  Such as this:

sum pregmon if age <= 20
sum pregmon if age > 20 & age <= 30
sum pregmon if age > 30 & age <= 40
sum pregmon if age > 40 & age ~= .
 

Although, it is often the case that we will continually be looking at these subgroups, and we may not want to time after time use these qualifiers.  If this is the case, an alternative method is to construct a new categorical variable that will distinguish each of these groups for us. With this new variable we can simply sort by it, and then computer the means of what ever variable we like.

Now to construct this new variable, there are a couple possibilities. First, we could do the following:

generate agegroup = .
replace agegroup = 1 if age <= 20
replace agegroup = 2 if age > 20 & age <= 30
replace agegroup = 3 if age > 30 & age <= 40
replace agegroup = 4 if age > 40 & age~=.
label variable agegroup "age group indicator"

sort agegroup
by agegroup: summarize pregmon

 

Running the commands above will produce the following table in Stata:

_______________________________________________________________________________
-> agegroup = 1
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
     pregmon |      72    4.833333   1.678363          1          9
_______________________________________________________________________________
-> agegroup = 2
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
     pregmon |     395    4.574684   1.856903          1          9
_______________________________________________________________________________
-> agegroup = 3
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
     pregmon |     329    4.379939   1.883422          1         10
_______________________________________________________________________________
-> agegroup = 4
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
     pregmon |     379    4.139842   1.705569          1          9

 

We should see that in the BAIS, the average of the pregmon variable for people under 20, 21-30, 31-40, and over 40 is 4.83, 4.57, 4.37, and 4.13 respectively.

There is a second way to create the exact same agegroup variable, although we can do it in one command instead of five. By combining the generate command with the recode command, we are able to construct new categorical variables very easily. So for example, if we wanted to create an exact copy of the variable we previously create, we could simply type:

generate agegroup2 = recode(age,20,30,40,41)

This command generates the exact same variable as we previously created (agegroup). Now, the question is how does this work. Well first for this combination of commands to work you must provide at least three arguments between the parentheses.  In this example, Stata finds all the observations that have an age value that is less than or equal to 20 and assigns each of these observations the value of 20 for the new variable agegroup2. Next, Stata finds all the observations that have an age value that is less than or equal to 30, but also great than 20, and assigns each of these observations the value of 30 for the new variable agegroup2. And so on, and so on.  We can double check to ensure the two new variables are the same, by typing:

tab agegroup
tab agegroup2

Looking at the frequency distributions you can see the variables are exactly the same.

Finally, there is another command combination that can be useful in creating these types of categorical measures. By combining the generate command with the autocode command, we are able to construct new categorical variables as well. This method is slightly different than those we previously used. Lets use the following example to learn this new command combination:

generate agegroup3 = autocode(age,3,10,64)

So let's walk through what this command has just done.  The value 3 in the parentheses specifies that we would like to create three equal size groups. Now this value could have been anything, it is up to the user to set this value. The final two values, in our example 10 and 64, define the range for the equal sized categories. The values for the new variable agegroup3 are the cut points for the equal sized categories. There is one thing to keep in mind when using this, all values falling outside the specified range will be put into the lowest and highest categories respectively. This shouldn't be a problem given that you are most likely not interested in working with these values, although if you want these values to be set to missing you will need to use a second replace command to do so.

 

 

MEDIANS AND MODES OF CONTINUOUS VARIABLES

Up to now, the only measure of central tendency that we have examined is the mean. There are two other measures that we wish to examine now, they are the median and the mode of a distribution. The median of a distribution is the value for which half the observations are greater and half are less. If observations are symmetrically distributed, the median and the mean will be the same. If the distribution of a variable is quite skewed however, the median and the mean will be quite different. In general, medians are sometimes used instead of means if one wants a measure that is robust to outliers. That is, we want a measure that is not very sensitive to extraordinary values in the distribution.

Let's consider the pregmon variable again (how many months pregnant when first visit clinic). To compare the mean and the median of this variable, type:

summarize pregmon, detail

      how many months pregnant when first visit clinic
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            2              1
10%            2              1       Obs                1175
25%            3              1       Sum of Wgt.        1175
50%            4                      Mean           4.395745
                        Largest       Std. Dev.      1.816304
75%            5              9
90%            7              9       Variance        3.29896
95%            9              9       Skewness       .7662708
99%            9             10       Kurtosis       3.640288

The option, detail tells Stata to give more information. Note that the output specifies the mean, 4.39 months, as before, but it now tells us more about different parts of the distribution. In particular, we can now see the pregmon value for which half (50 percent) of the observations are higher and half are lower -- in other words the median. Note that the median is 4 months.

The median is slightly less than the mean value. What is going on here? It is informative to review the graph of pregmon above. Reviewing that graph shows a fairly normal distribution, although there is a slightly larger group of values at the upper end of the distribution. Specifically this cluster of values represents those women coming into the clinic after approximately nine months of pregnancy, most likely these women are coming in to deliver their babies. That abnormally large number of values at the upper end of the distribution is inflating the mean, whereas the median treats it as just one more value above the "half way" mark. For many variables with skewed distributions, the median is a very useful statistic.

The mode of a distribution is the value that appears most often in the distribution. The mode is a seldom used measure, but we should be aware of it. Let's consider the education variable -- educ (number of years in school). The mode of this variable represents the schooling level that the most respondents claimed. There is no simple way to compute the mode in Stata. One option for computing the mode of a distribution is to use the tabulate command. Type:

tabulate educ

  number of |
   years in |
     school |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         42        1.22        1.22
          2 |         92        2.67        3.89
          3 |        152        4.41        8.29
          4 |        236        6.84       15.14
          5 |        234        6.79       21.93
          6 |        292        8.47       30.39
          7 |        577       16.73       47.13
          8 |        194        5.63       52.76
          9 |        587       17.02       69.78
         10 |        402       11.66       81.44
         11 |         76        2.20       83.64
         12 |        328        9.51       93.16
         13 |         51        1.48       94.63
         14 |         44        1.28       95.91
         15 |         49        1.42       97.33
         16 |         27        0.78       98.11
         17 |         34        0.99       99.10
         18 |          8        0.23       99.33
         19 |          6        0.17       99.51
         20 |          9        0.26       99.77
         21 |          3        0.09       99.85
         22 |          3        0.09       99.94
         23 |          1        0.03       99.97
         25 |          1        0.03      100.00
------------+-----------------------------------
      Total |       3448      100.00

By examining the entire frequency distribution, we should note that the mode of educ is 9 years in school. A large number of respondents (587) reported 9 for their number of years in school.

Now, just as we looked to see how averages of particular variables varied across other measures, we can also do the same with the median and mode. For example, the variable educ (the number of years a respondent has spent in school) might be influenced by the age of the respondent. It is the case that the older an individual is, the more opportunity they have to spend time in school. Thus you might expect as respondents get older the modal and median values should increase. Is this what you expect to find in the BAIS data? See if you are correct using the commands you have learned so far, as well as any variables we may have created along the way. Were you correct?

Let's try this together. One way to examine the relationship between educ and age is by using our newly constructed variable agegroup. Just as we compared mean values using the agegroup we can similarly use the variable to compare medians and modes. We could use the following to see how the median value for years spent in school varies across the different age groups:

sort agegroup
by agegroup: sum educ, detail

_______________________________________________________________________________
-> agegroup = 1
                  number of years in school
-------------------------------------------------------------
      Percentiles      Smallest
 1%            2              1
 5%            3              1
10%            4              1       Obs                1555
25%            5              1       Sum of Wgt.        1555
50%            7                      Mean           7.297106
                        Largest       Std. Dev.       2.82669
75%           10             15
90%           11             17       Variance       7.990178
95%           12             19       Skewness       .1435847
99%           13             23       Kurtosis       2.927522
_______________________________________________________________________________
-> agegroup = 2
                  number of years in school
-------------------------------------------------------------
      Percentiles      Smallest
 1%            2              1
 5%            5              1
10%            7              1       Obs                 901
25%            9              1       Sum of Wgt.         901
50%            9                      Mean           9.615982
                        Largest       Std. Dev.      2.953177
75%           12             19
90%           13             19       Variance       8.721255
95%           15             20       Skewness       .1978521
99%           17             22       Kurtosis       4.195632
_______________________________________________________________________________
-> agegroup = 3
                  number of years in school
-------------------------------------------------------------
      Percentiles      Smallest
 1%            2              1
 5%            3              1
10%            5              1       Obs                 497
25%            7              2       Sum of Wgt.         497
50%            7                      Mean           8.591549
                        Largest       Std. Dev.      3.537838
75%           10             20
90%           14             20       Variance        12.5163
95%           16             20       Skewness       .7778991
99%           20             21       Kurtosis       3.752933
_______________________________________________________________________________
-> agegroup = 4
                  number of years in school
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            2              1
10%            2              1       Obs                 495
25%            4              1       Sum of Wgt.         495
50%            7                      Mean           7.216162
                        Largest       Std. Dev.      4.273985
75%            9             21
90%           13             22       Variance       18.26694
95%           16             22       Skewness       1.134114
99%           21             25       Kurtosis       4.417567

 

Looking at the output from our previous command, it appears that there is very little variation in the median across the different age groups. Does this surprise you?  Part of this may be due to the age categories we constructed, it is possible we may get different results if we changed the boundaries of these categories.  It also may be the case that this relationship is influenced by a third variable that we have not accounted for in our analysis.

 

Now it is your turn, as an exercise, try the following question:

4. Compute the mean years in school for adults (21 and older) by gender and of non-adults (under 21) by gender. What do these results say about the future?
Question 4 Answer

 

Let's try one more example before moving on. What if we were interested in examining the variable agedied_1, which provides the age of the person who died most  recently in the household. As we have done with many of the previous examples, we could begin by examining the mean, median, or mode or this variable. In past examples, to compute the mean of the variable agedied_1 we would simply use the following commands:

sum agedied_1

The command produces the following:

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
   agedied_1 |     603    40.65174   26.84541          0         98


From the output it appears the mean age of the person who died most recently in the household is 40.65. Is this the correct answer? Have we computed the appropriate mean?  Up to this point this would have been correct, although unlike the previous examples agedied_1 is a household level variable.  This means that because we have not treated this variable as a household level measure, our estimate is biased. Larger household have contributed more information in calculating this mean than smaller households.  To correct for this we must somehow specify that agedied_1 should be treated as a household level measure, meaning each household should contribute only one age value to the calculation of the mean. To do this, we must use the _n option.  Try entering the following:

sort hhid
sum agedied_1 if hhid~=hhid[_n-1]

 

The command produces the following:

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
   agedied_1 |     107    43.57944   27.42882          0         98

 

Including the qualifier at the end of our sum agedied_1 tells Stata to treat the agedied_1 variable as a household level measure. Looking at the output, you can see that the number of observations used to calculate the mean, as well as the mean value itself has changed. 

As you can see, it is very important to remember what level of variable you are working with. Your final results can dramatically changed if you treat a variable incorrectly. As a general rule, any time you are working with a household level variable, you will need to specify this in your commands (usually this is done using the _n option).

So for example, if you wanted to calculate the mode for the variable agedied_1, you would need to use the following commands:

sort hhid
tab agedied_1 if hhid~=hhid[_n-1]

Without this qualifier your results will be biased, as larger households will be contributing more information than smaller households.

 

 

MEASURES OF DISPERSION -- VARIANCE AND STANDARD DEVIATION

So far, we have learned about the average levels of age, schooling, months prior to visiting the clinic, and so on. While computing the mean can be an extremely informative and important step in analyzing data, we must remember to consider measures of dispersion when interpreting these averages.

For example, higher average incomes or higher average levels of schooling are generally a good thing. Although from a policy perspective, we will often care more about how widely distributed the distribution of income or education is. As an extreme, it could be the case that everyone in the country has exactly the same level of education (and that level of education would coincidently be the "mean"). That situation could be considered to be perfectly equal distribution of education. Of course, in the BAIS data,  individuals and households have different levels of schooling.

In an effort to clarify this notion of variance consider two theoretical distributions with the same mean, as shown below. Both of these theoretical distributions have a mean income of about 4500 pula. Examine these two distributions.

 

Even though both distributions have the same average level of income, we can see that the income inequality is greater in the top graph. If we only computed the mean of each of these (made-up) income variables, we might conclude that these distributions were essentially the same. The top distribution is more dispersed. The variance is a useful measure of dispersion of a variable. The square root of the variance is termed the standard deviation. A bigger variance always means a larger standard deviation. In the above two distributions, the standard deviation of Income Distribution #1 is 1000 while the standard deviation of Income Distribution #2 is 500. Since the standard deviation of the first distribution is twice as large as that of the second distribution, its variance is about four times as large. One useful way to think about the standard deviation of a distribution is the following. Suppose that the distribution of a variable is bell-shaped (more formally termed a normal distribution.) If we picked randomly from the normal distribution, two thirds of the time we would pick a value that was within two standard deviations of the mean of the distribution. For the case of Income Distribution #1 above, this means that if we randomly picked an income from this distribution, two out of three times we would probably pick an income within 2000 pula of the mean income of 4500 pula.

In Stata, there are a number of ways to determine the standard deviation of a variable, but the simplest is probably the summarize command. What is the mean and standard deviation of the variable educ by location in Botswana? To find out, type:

sort location
by location: summarize educ

_______________________________________________________________________________
-> location = Urban
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
        educ |     996    9.065261   3.318103          1         22
_______________________________________________________________________________
-> location = Urban Vi
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
        educ |     898     8.14922   3.369481          1         22
_______________________________________________________________________________
-> location = Rural
    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
        educ |    1554    7.404118    3.23537          1         25

 

We discover that the mean and standard deviation for each of these groups is different from the other. Put another way, educational inequality appears to vary a bit by group.

 

 

UNDERSTANDING THE DISTRIBUTIONS OF CATEGORICAL VARIABLES

Many variables in BAIS are categorical variables. Examples that we will work with in this subsection include citizen and occupation. In Module 3 we have already investigated how to investigate the frequency distribution of categorical variables. In this section, we ask whether there are statistics analogous to the mean and standard deviation (which we use to describe distributions of continuous variables) that we can use to describe distributions of categorical variables.

We begin with a cautionary note. Put simply, taking the mean of a categorical variable yields nonsense. Consider the citizen variable - Stata will let we compute the mean of it. Try the following:

means citizen

    Variable |    Type        Obs        Mean       [95% Conf. Interval]
-------------+----------------------------------------------------------
     citizen | Arithmetic    7734    1.556892         1.42649   1.687294 
             |  Geometric    7734    1.061038        1.051061   1.071109 
             |   Harmonic    7734    1.020026        1.016963   1.023107 
------------------------------------------------------------------------

We find the mean of the citizen variable is 1.55. What does this mean? As near as we can tell, it means nothing. It certainly does not mean that the average respondent is about halfway between a citizen of Botswana and a citizen of Malawi. The coding of the citizen variable was arbitrary. The BAIS data could just as easily have coded a Botswana citizen as 19, a citizen of Malawi as 20, a citizen of Namibia as 300, a citizen of South Africa as 4, and so on.  With a different coding scheme the mean of the variable would change as well, although the distribution of citizens would not change. If the mean of a categorical variable is nonsensical, are there other measures that do convey information? There are two - the mode and the range of the distribution.

The mode of the distribution is that value that appears most often in the sample. There is no Stata command that gives the mode without also giving we a bunch of other information. Probably the best way to compute the mode of a distribution is to use the tabulate command. The mode of the citizen variable tells us which category of citizens has the most respondents while the mode of the occupation variable tells us which occupation was claimed by the most respondents. To find the mode of the citizen variable and occupation variable, type:

tab citizen

  country of |
 citizenship |      Freq.     Percent        Cum.
-------------+-----------------------------------
    Botswana |       7567       97.84       97.84
      Malawi |          9        0.12       97.96
     Namibia |          3        0.04       98.00
South Africa |         12        0.16       98.15
   Swaziland |          4        0.05       98.20
      Zambia |         46        0.59       98.80
    Zimbabwe |         31        0.40       99.20
    Tanzania |         13        0.17       99.37
       India |         17        0.22       99.59
          UK |          5        0.06       99.65
       Other |         27        0.35      100.00
-------------+-----------------------------------
       Total |       7734      100.00

tab occupation

               occupation |      Freq.     Percent        Cum.
--------------------------+-----------------------------------
           Administrators |         69        3.03        3.03
            Professionals |        107        4.70        7.73
              Technicians |        168        7.38       15.11
                   Clerks |        185        8.13       23.24
          Service Workers |        274       12.04       35.28
    Skilled Agric Workers |        446       19.60       54.88
            Craft Workers |        305       13.40       68.28
Plant & Machine Operators |        159        6.99       75.26
               Elementary |        544       23.90       99.17
               Not Stated |         12        0.53       99.69
             User Missing |          7        0.31      100.00
--------------------------+-----------------------------------
                    Total |       2276      100.00

We find that the mode of citizen is "Botswana" while the mode of the occupation variable is "Elementary"

Just as the mean of the distribution of a categorical variable does not make any sense, neither does the standard deviation. Still, it may be useful to know the range of the values of a categorical variable. That is, what values are spanned by the variable codes? To compute the range of a variable, use the codebook command:

codebook occupation

occupation --------------------------------------------------------- occupation
                  type:  numeric (byte)
                 label:  OCCUP
                 range:  [1,11]                       units:  1
         unique values:  11                   coded missing:  5458 / 7734
              examples:  7     Craft Workers
                         .     
                         .     
                         .     

The above result tells us that the occupation variable ranges from 1 to 11. We could also learn this, and more, using the tabulate command.

 

 

COMBINING TABULATE and SUMMARIZE

Earlier in Module 2 we introduced two important options related to the tabulate command: missing and nolabel. Although now that we have learned about dealing with measures of central tendency in Stata, we can now introduce a third tabulate option. This third option is both the most complicated and the most useful. Using the summarize option with tabulate allows us to examine how the average of one variable differs by the categories of a second variable. Working through an example is really helpful in understanding the usefulness of this new option.

For example, say we were interested in determining the average number of years living continuously in current residence for all of the various types of citizens in the BAIS data set. Up to this point we would have been forced to use qualifiers to restrict the group that Stata works from. Thus to answer our question, we would have entered the following:

summarize resyears if citizen == 1

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
    resyears |    3859    13.14382   12.45312          0         64

The above syntax would provide us with the average number of years living continuously in current residence for the citizens of Botswana. Then to obtain the rest of the averages, we would need to repeat the above syntax for each of the other 10 citizen  groups. Although, instead of having to take these steps, Stata provides us with a quicker way to obtain the exact same information.

Let's start by just typing:

tab citizen

This command produces a table we have seen before, the frequency distribution of the variable citizen.

  country of |
 citizenship |      Freq.     Percent        Cum.
-------------+-----------------------------------
    Botswana |       7567       97.84       97.84
      Malawi |          9        0.12       97.96
     Namibia |          3        0.04       98.00
South Africa |         12        0.16       98.15
   Swaziland |          4        0.05       98.20
      Zambia |         46        0.59       98.80
    Zimbabwe |         31        0.40       99.20
    Tanzania |         13        0.17       99.37
       India |         17        0.22       99.59
          UK |          5        0.06       99.65
       Other |         27        0.35      100.00
-------------+-----------------------------------
       Total |       7734      100.00

While the table above provides us with useful information, to answer our question we do not need the actual number of citizens that fall in each category. Instead, we would like the average number of years living continuously in current residence for each of these groups. This is where the summarize (which can be abbreviated as sum) option comes in, using the sum option we are able to obtain the mean values of a variable by the categories of another. So working from our example, to obtain the average number of years living continuously in current residence for each citizen group we would enter the following in Stata: 

tab citizen, sum(resyears)

            |       Summary of years living
 country of |      continuously in locality
citizenship |        Mean   Std. Dev.       Freq.
------------+------------------------------------
   Botswana |    13.14382   12.453118        3859
     Malawi |           3   1.7320508           7
    Namibia |           2           0           1
  South Afr |       3.875   3.0908852           8
  Swaziland |           0           0           3
     Zambia |        2.72   3.4583233          25
   Zimbabwe |   3.0526316   2.7380789          19
   Tanzania |          13   19.899749           7
      India |   6.4545455    3.266914          11
         UK |           4    6.164414           4
      Other |   8.4210526   4.3883152          19
------------+------------------------------------
      Total |   12.929599   12.403279        3963

Unlike the previous table that gave us the frequency distribution of the variable citizen, the table above includes the information normally produced by the summarize command, except that we have this information for each of the citizen groups. As we can see the combination of the tabulate command and the sum option is very powerful tool when attempting to obtain summary information.

OK, now that we have been through one example, let's see if we understand how and when to use the sum options. Give the following question a try:

5. How would we compute the average age by location?
Question 5 Answer

 

EXERCISES

Now it is our turn to explore the measures of central tendency and variability using the commands from Module 4. Using Stata and the BAIS data set, answer the following questions.

  1. What is the average age of household heads in the data?
  2. Exercise 1 Answer
  3. Do people with years of schooling at or above the median level report a higher age at first sexual intercourse than those who report years of schooling below the median level, and if so, by how much?
  4. Exercise 2 Answer
  5. Is the number of years married more widely distributed in rural areas or in urban areas?
  6. Exercise 3 Answer
  7. What language is spoken by more respondents than any other? That is, what is the modal language?
  8. Exercise 4 Answer
  9. Is the percentage of respondents who self-identify as being a "professional" as their occupation higher for people who have been away from home for more than a month or for people who have not been away from home for more than a month?
  10. Exercise 5 Answer

 

BACK TO TOP