Module 1: Introduction to Surveys
Module 2: Getting Started with Stata
Module 3: Understanding Distributions
Module 4: Measures of Central Tendency
Module 5: Bivariate Analysis
Module 6: Simple Regression Analysis
Module 7: Multiple Regression Analysis
Module 8: Discrete Outcome Analysis
Graphing with Stata 8

UNDERSTANDING DISTRIBUTIONS

 

TABLE OF CONTENTS

Introduction
Variable Types
Count Command
Frequency Distribution Tables
Using [_n]
Frequency Distribution Graphs
Exercises

 

 

 

 

 

 

 

INTRODUCTION

Now that we have acquired some basic Stata skills, we are ready to begin analyzing the data. The immediate problem is -- Where do we begin? As we have seen, there are an enormous number of variables and observations at our disposal although none of this information in its raw form is especially useful. How can we summarize this massive amount of information simply and quickly to make it more accessible?

From Module 2, we now know how to open a data file and examine the file's contents in Stata. It turns out that we have more information than we can usefully process. To see this, load the BAIS data set and simply type:

list age

A large number of observations scroll by. After seeing these numbers fly by, do you now know more about the age distribution in Botswana? Probably not. We need a handy way to summarize lots of quantitative raw data, and Stata is very helpful here. For example, while we could, in principle, simply count the thousands of observations as they scroll by, Stata commands essentially do this in a much more efficient and sophisticated fashion. We will start by learning how to answer the following questions:

1. How many South African citizens are in the data?
2. What percentage of the sample is married?
3. How many 30 year old men in the sample reside in rural areas?
4. Are there more men or women over the age of 60?
5. What percentage of the BAIS sample is made up of sons and daughters?
Using Stata, these questions are quickly and easily answered. First, though, we need to learn about different types of variables.

 

VARIABLE TYPES

In the BAIS data, there are several types of variables. If we have loaded bais.dta into Stata, we will see the list of variables in the "Variables" window. Economists, sociologists, and psychologists use different language to describe the multiple types of variables. Here, we will divide variables into two types: continuous variables and categorical variables. However, each of these types of variables can be applied to either the individual or the entire household. In this regard, we must understand the difference between an individual-level variable and a household level variable. The statistical and graphical tools used to understand the distributions of the various types of variables are quite different, so it is important to understand the differences between these measures.

Continuous Variables: Continuous variables have an infinite number of possible values that fall between any two observed values. For example, consider age. In our data, age is recorded in years. But it could have been recorded in months, days, minutes, or even seconds. A continuous variable is ordinal in the sense that it's values have an inherent order. In the age example, an age of 16 years is one year older than the age of 15 years, thus the unit of measurement in between these two values is itself meaningful. (This may seem like common sense, but when we consider categorical variables, this will no longer be true.) Examples of continuous variables in the BAIS data set include age (age), year of education (educ), and the number of times a respondent has given birth (nobirth).

We are actually not being terribly careful with our definitions. Consider, for example, the variable that counts the number of births a respondent has had (nobirth). This variable might be 4 or 5, but it will never be 4.34. Nonetheless, if we were told that the average number of births was 4.34, this would be comprehendible. We would know that, on average, there are more than 4 births and less than 5. We are going to treat variables like nobirth as continuous variables. (Some disciplines refer to these as discrete variables.) Taking the average gives an answer that is readily interpreted. Taking the average of a categorical variable, on the other hand, yields nonsense.

Categorical Variables: Categorical variables, also known as nominal variables are made up of separate and distinct categories which do not have an inherent order. To code these variables, each category is typically assigned a value, but this assignment is arbitrary. Take for example the religion variable, religion. Each religious group is assigned an arbitrary value. In the data set, if a person is Christian, the religion variable for that person is set to 1. If the person is Muslim, the value is set to 2, Hindu is 4, and Other is 6. For the gender variable gender, males are coded as 1 and females as 2. Other examples of what we will consider categorical variables include the household id number (hhid) the relationship to the head of household (relhead), and location (location).

A special type of a categorical variable is a dummy variable. A dummy variable is a variable that typically takes on a value of one if the observation meets specified criteria and a value of zero if otherwise. There are many dummy variables in the BAIS data. We will often want to create dummy variables ourselves. For example, if we wanted to create a dummy variable for whether a respondent  is the household head, we could use the following Stata commands.

generate head = .
replace head = 1 if relhead == 0
replace head = 0 if relhead ~=0

OR

generate head2 = relhead==0

In general, it is important to know the types of variables you are using because some of the tools used to analyze variables differ depending on whether the variable is continuous or categorical. Another point to keep in mind is whether the variable you are using is an individual-level or household-level variable.

Individual-level Variables: Individual-level variables are made up of values that are unique to each individual respondent. An example is the variable for age (age). To see an example of an individual-level variable, type the following:

sort hhid
list hhid age

As we can see, each person in the same household has an age value that is unique to them. Other examples in the BAIS data set would include the following variables: relhead (relationship to household head), educ (educational attainment level), and gender.

Household-level Variables: Household-level variables have the same value for every person in the household. An example would be the variable for transport1 (household member owns a car). To see an example of a household-level variable, type the following commands:

sort hhid
list hhid transport1

As we can see, each person in the same household has the same value for the transport1 variable. Other examples in the BAIS data set include the following variables: hhorphans (are there orphans in the household) and numdeaths (number of household members that have died in last 12 months).

 

COUNT COMMAND

COUNT counts the number of observations that satisfy specified conditions. If no conditions are specified, count displays the number of observations in the data set. For example, to count the number of observations in the BAIS data set, we would type:

count

The results should show that there are 7734 observations in this data set. However, try an example using a qualifier. For instance, suppose we want to count the number of females in the data set.

count if gender==2

The results should show that there are 4051 females in the data set.

 

FREQUENCY DISTRIBUTION TABLES

A frequency distribution table is simply a listing of all observed values for a given variable and the number of observations that fall under each of these values. To create a frequency distribution table in Stata, we use the command tabulate.

For example, to create a frequency distribution table for the categorical variable marstat, you type:

tab marstat

The above command produces the following distribution table in the Stata Results window:

 marital status |      Freq.     Percent        Cum.
----------------+-----------------------------------
        Married |        576       14.66       14.66
Living Together |        552       14.05       28.72
       Divorced |         41        1.04       29.76
        Widowed |         89        2.27       32.03
      Separated |         17        0.43       32.46
  Never Married |       2653       67.54      100.00
----------------+-----------------------------------
          Total |       3928      100.00
 

As we can see from the table, there are six distinct categories or values found within the marstat variable. The six observed values being Married, Living Together, Divorced, Widowed, Separated, Never Married. Stata gives us 3 specific numbers related to each observed value. The column with the header "Freq." is the number of observations that fall within each category. Thus, we can now answer the question, "How many married respondents are in the data?" The answer being there are 576 married respondents in the BAIS data.

The second column with the header "Percent" represents the percentage of the sample that falls within each observed category. Thus, we can now answer the question "What percentage of the sample is made up of widowed respondents?" Widowed respondents make up 2.27% of the BAIS sample.

The third column with the header "Cum." represents the cumulative percentage of the corresponding observed values. For example, 29.76 percent of the BAIS sample is made up of observations with values Married, Living Together, and Divorced. In other words, approximately 29.76% of the sample is found in the these three categories.

Ok, now it is your turn to answer a few questions:

1. What percentage of the sample is 70 years old and younger?
Question 1 Answer
2. How many resident heads are there in the BAIS sample?
Question 2 Answer

There are two options that are used with the command tabulate that are worth noting. The first is the nolabel option. When we use the nolabel option with the tabulate command, the value labels that sometimes appear in place of the actual recorded numeric value will not be displayed. Instead, the original numeric value will be displayed in the table. This option can best be understood using an example. Let's use the gender variable for our example. Start by displaying a frequency distribution table for gender without any options:

tab gender

Without the nolabel option the tabulate command produces the following table when used with the variable gender:

     sex of |
 respondent |      Freq.     Percent        Cum.
------------+-----------------------------------
       Male |       3683       47.62       47.62
     Female |       4051       52.38      100.00
------------+-----------------------------------
      Total |       7734      100.00
From the table above, we see that there are two observed gender values displayed in the table, "Female" (indicating a female respondent) and "Male" (indicating a male respondent). These value labels are used to help the user identify what each numeric value represents. Thus, instead of displaying an arbitrary number, text has been substituted in the numeric value's place. To display the actual numeric values, use the nolabel option with the tabulate command:

tab gender, nolabel

With the nolabel option, the tabulate command produces the following table when used with the variable gender:
     sex of |
 respondent |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |       3683       47.62       47.62
          2 |       4051       52.38      100.00
------------+-----------------------------------
      Total |       7734      100.00

As we can see, the two previous tables are identical except that the value labels "Male" and "Female" have been replaced by the actual numeric values. At times it is useful to see the actual numeric values of a variable instead of these value labels. For example, if we were referencing the values for gender, we would need to use the actual numeric values - not the value labels. To emphasize this point, let's try and generate a new variable using the variable gender to recode.

Go ahead and give this exercise a try:

3. How would you create a new variable that is equal to 1 if the respondent is a women and equal to 0 if the respondent is a man (use the gender variable to identify the gender of the respondents)?
Question 3 Answer

Did you have trouble with question 3? If you did, most likely you were using the wrong values. While at times the value label is what is displayed in the frequency distribution table, it is not the actual value that is stored in the data set. You must reference the original numeric value for the replace command to work correctly. This is why the nolabel option is so helpful, when used with the tabulate command, the original numeric value is displayed.

A second option used with the tabulate command is missing. The missing option displays all system missing values for the specified variable. Thus, to display all system missing values for the variable marstat we would type:

tab marstat, missing

The above syntax displays the following table in the Stata Results Window:

 marital status |      Freq.     Percent        Cum.
----------------+-----------------------------------
        Married |        576        7.45        7.45
Living Together |        552        7.14       14.58
       Divorced |         41        0.53       15.12
        Widowed |         89        1.15       16.27
      Separated |         17        0.22       16.49
  Never Married |       2653       34.30       50.79
              . |       3806       49.21      100.00
----------------+-----------------------------------
          Total |       7734      100.00

From the table above, we see that a new category has been added to the marstat variable distribution. This new category, represented by a dot, reflects the number of system missing observations. It is important to note that the raw percentages and cumulative percentages are different from those presented in the table created without the missing option. This is due to the increase in the number of total observations recognized within the distribution.

Now that we have introduced you to some of the basics, let's see how well you can use these new commands.

Try these quick exercises:

4. How many respondents are currently studying in Standard 6?
Question 4 Answer
5. Are there more men or women over the age of 60?
Question 5 Answer
6. How many Muslim men in the sample reside in rural areas?
Question 6 Answer
7. How many respondents in the sample have a missing value for the variable identifying the relationship to the head of household?
Question 7 Answer

 

USING[_n]

The above examples counted the number of certain sub-populations in the BAIS data set. However, suppose we just want to count the number of households in the data set. To accomplish this task, the [_n] option is very helpful. Using the [_n] option allows us to treat a household-level variable as an individual-level variable. For this example, we would want to type the following:

sort hhid
count if hhid~=hhid[_n-1]

1782

The 1782 that Stata gives us, is the number of households in the data set. For this qualifier to work, we must first sort the data. The hhid~=hhid[_n-1] qualifer is telling Stata to go to every hhid and only count it if that hhid is not equal to the one before it. Here is a visual display of what Stata is doing:

   hhid     counted
    1          1  
    1          .  
    1          .  
    1          .  
    1          .  
    2          1  
    2          .  
    2          .  
    2          .  
    2          .  
    3          1  
    3          .  
    3          .  
    4          1  
    5          1  
    5          .  
    6          1  
    6          .  
    6          .  
    6          .  
    6          .  
    6          .  
Total:  22 observations          6 households
Although this is only the tip of the iceberg on what the [_n] option can do and we will be using it more in future modules, maybe we should try one more example. Suppose we wanted to create a new type of household variable that will only be applied to the first person in the household. Let's look at the household variable for whether there has been a death in the last 12 months (hhdeaths). If we were to sort by hhid and then list the hhid and hhdeaths, we would see a piece of the following:
           hhid  hhdeaths 
   1.         1        No  
   2.         1        No  
   3.         1        No  
   4.         1        No  
   5.         1        No  
   6.         2        No  
   7.         2        No  
   8.         2        No  
   9.         2        No  
  10.         2        No  
  11.         3        No  
  12.         3        No  
  13.         3        No  
  14.         4        No  
  15.         5        No  
  16.         5        No  
  17.         6        No  
  18.         6        No  
  19.         6        No  
  20.         6        No  
  21.         6        No  
  22.         6        No  

As we can see, Stata has produced a list that shows everyone with the same household identification number (hhid) as having the same hhdeaths value. However, we want to produce a list in which Stata only shows the household deaths value for the first person in the household. To do this, we must create a new variable that is slightly different from the hhdeaths variable.

sort hhid
gen hhdeaths2=hhdeaths if hhid~=hhid[_n-1]

We have just created the new variable hhdeaths2. The command above generates the new variable hhdeaths2 and assigns the values based on the original variable (hhdeaths), but it only records the new values if the hhid value for a given observation is different from the hhid value directly "above" it. To better understand this concept, we should do the following:

sort hhid
list hhid hhdeaths hhdeaths2

           hhid  hhdeaths  hhdeaths2 
   1.         1        No          2  
   2.         1        No          .  
   3.         1        No          .  
   4.         1        No          .  
   5.         1        No          .  
   6.         2        No          2  
   7.         2        No          .  
   8.         2        No          .  
   9.         2        No          .  
  10.         2        No          .  
  11.         3        No          2  
  12.         3        No          .  
  13.         3        No          .  
  14.         4        No          2  
  15.         5        No          2  
  16.         5        No          .  
  17.         6        No          2  
  18.         6        No          .  
  19.         6        No          .  
  20.         6        No          .  
  21.         6        No          .  
  22.         6        No          .  

With the new variable hhdeaths2, only the first person in the household received the value of the household-level variable, hhdeaths. Why do this? Basically, this will become very helpful when we examine household-level variables. In doing this, we will hopefully avoid the problem of bias.

 

FREQUENCY DISTRIBUTION GRAPHS

Now that we know how to use do files, we can begin to graph good looking graphs.

While the tabulate command gives us one way to understand the frequency distribution of a variable, graphing is another way. In principle, each can convey the same information. Often, though, graphs are more readily interpreted. If we are trying to convey information to a colleague with limited (or no) quantitative training, a graph sometimes will be more effective than a table. Also, graphs are, even for those with lots of statistical sophistication, a wonderful way to get a quick feel for the information in the data set.

Stata is a very powerful graphing tool and in this module, we introduce the basics. We will learn how to create and interpret three kinds of graphs -- histograms, bar graphs, and pie graphs. Let's go!

First things first, in general we can think of the graph command as having the following form:

[graph] [graph type] [plot type] [if exp] [in range] [, graph type_options],

where graph type can be twoway, matrix, bar, dot, box, pie, or other;

plot types is mainly for twoway graph types and can be scatter, line, bar, dot, among others.

Type: help graph_twoway to see all the possibilities. For now, let's learn about the basic graphs - histograms, bar charts, and pie charts.

 

HISTOGRAMS

Histograms are a graphical tool that tells us the fraction of observations, for any given variable, that fall within different ranges. Histograms are used for continuous variables. As a running example, we will consider the variable agemar, which is the age of the respondent when they first got married. To draw a histogram, we can leave almost everything up to Stata. To tell Stata to draw a histogram using the agemar variable, we type the following:

graph twoway histogram agemar, fraction

We will see Stata draw the following graph.

 

NOTE: In the past (prior to Stata 8), this initial graphing command would require additional specifications to be useful, however this new version of Stata makes graphing a bit easier and more robust. As mentioned above, this module will keep instruction fairly simple by focusing on the basic graphing commands, however, for a more thorough and more complex graphing tutorial visit our Graphing Module.

This first graph is a good start, but this command allows for many more specifications. Instead of using "fraction," we could have specified density, frequency, or percent. Each would produce a similar looking histogram but each would have a different y-axis. For now, let's continue with fraction and later when we show you how to combine graphs, we'll show you what those other options produce.

To improve this first histogram, let's type:

#delimit ;

histogram agemar, frac
title("Graphing Example - Age at First Marriage")
xtitle("Respondent's Age at First Marriage")
note("Source: Botswana AIDS Impact Survey 1")
ylabel(0(.05).15, angle(horizontal)) ytick(0(.025).15)
xlabel(0(10)70) xtick(0(5)70);

Now we should see the following histogram:

 

This is much better!

Let's go over the syntax that created this good looking histogram. First, given the length of the Stata command, we needed to use a delimiter that tells Stata that a carriage return (Stata's default) does not end the command, but instead a semi-colon does. This allows us to enter commands on multiple lines, as we did above. Next, note that we did not need to include graph twoway to tell Stata what we wanted, in this case it is only necessary to type histogram. Similarly, we did not need to completely spell out fraction, frac works just as well. Next, we told Stata to title the graph "Graph Example - 1" then we changed the default x-axis title, which is based on the variable label, to "Respondent's Age at First Marriage". We also included a note at the bottom of the graph that informs the reader where the data came from, in our case we are using BAIS data. The remaining lines of syntax, tell Stata how to reformat the x- and y-axes. We told it to relabel the y-axis from 0 to .15 in .05 increments. Then we asked for tick-marks to be placed in between the labeled units. The angle(horizontal) option told Stata to change the default y-axis labels to be read horizontally. Similarly, we asked Stata to relabel the x-axis in increments of 10 from 0 to 70 and to include tick-marks in increments of 5.

We can see that we have come a long way. This graph tells us, among other things, that a majority of respondents are getting married between the ages of 20 and 35. We could be more specific if we counted the number of bins (bars) and estimated the fraction of the observations in each bin.

While these first two graphs are very informative, we could for example look at these graphs by gender. We can accomplish this with the "by" option. To graph age at first marriage by gender, we type:

sort gender;
histogram agemar, frac by(gender);

 

 

We now see how the distribution of agemar varies by gender. Stata placed each of the gender categories in one graph to make comparison easy. Surely, however, we can make this graph look better - to do so, we can type:

#delimit ;

histogram agemar, frac by(gender)
title("Graphing Example - Age at First Marriage")
xtitle("Respondent's Age at First Marriage")
note("Source: Botswana AIDS Impact Survey 1")
ylabel(0(.05).15, angle(horizontal))
ytick(0(.025).15)
xlabel(0(10)70)
xtick(0(5)70);

We should get the following graph:

 

Looking at the graph above, there is still some room for improvement. First thing to realize is that the by(group) option is treated as a "repeating" option. Meaning that Stata plotted each gender category individually and then merged them into a single graph, which is what we see. As a result, each of the gender categories has its own title, x-axis, and note attached to it, but that is not what we want. Instead we should type the following set of commands:

#delimit ;

histogram agemar, frac
by(gender, title("Graphing Example - Age at First Marriage")
note("Source: Botswana AIDS Impact Survey 1"))
xtitle("Respondent's Age at First Marriage")
ylabel(0(.05).15, angle(horizontal)) ytick(0(.025).15)
xlabel(0(10)70) xtick(0(5)70);

 

 

Now that's much better.

Notice how we had to include both our title and our note, inside the by(group) option. The remaining instructions are similar to the ones for the second graph above.

For now, study the following graph. Type:

#delimit ;

histogram agemar, frac
by(gender, title("Graphing Example - Age at First Marriage")
subtitle("(in Years)")
note("Source: Botswana AIDS Impact Survey 1")
caption("Botswana Distance Learning Project") row(1))
xtitle("")
ylabel(0(.05).15, angle(horizontal)) ytick(0(.025).15)
xlabel(0(10)70, angle(vertical)) xtick(0(5)70);

 

 

The final graph in this age at first marriage example, demonstrates how Stata is a truly versatile graphing tool. Remember this is merely the tip of the graphing-iceberg, for more details see our Graphing Module and/or consult the Stata GRAPHING manual and the online graphing help - help graph.

 

BAR & PIE GRAPHS

Similar to Histograms, Bar and Pie graphs are graphical tools - visual aids if you will, which inform the reader of the distribution of any particular variable. Unlike histograms, however, Bar and Pie graphs are used for categorical variables only. As an example, we will consider the variable location, which identifies the location of a given respondent's residence. As with any other analysis, we must know what the variable we're working with "looks" like. In our case, we can do this by simply tabulating location.

Type:

tab location, missing

   location of |
     household |      Freq.     Percent        Cum.
---------------+-----------------------------------
         Urban |       1808       23.38       23.38
Urban Villages |       2019       26.11       49.48
         Rural |       3907       50.52      100.00
---------------+-----------------------------------
         Total |       7734      100.00

This table gives us a sense of what we should expect. It gives us a frame of reference to compare our resulting graphs. It also tells us that there are no missing cases in our variable. Now, in order to construct a bar or pie graph for a categorical variable we must first construct dummy variables for each category of the specified variable. Thus, we must begin by creating dummy variables for each of the location categories.

As we can see, the numeric values for the variable location are not displayed. Instead, the value labels are listed in the distribution table (Urban, Urban Villages, and Rural). To determine the numeric values, which we need to construct the new dummy variables, we will need to use the "nolabel" option.

Type:

tab location, missing nolabel

location of |
  household |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |       1808       23.38       23.38
          2 |       2019       26.11       49.48
          3 |       3907       50.52      100.00
------------+-----------------------------------
      Total |       7734      100.00

 

A frequency distribution table is displayed. This time, instead of the value labels "Rural", "Urban", and "Urban Villages", the numeric values are shown. From the table we see that there are three numeric values found in the distribution of the location variable: 1, 2, and 3.

Now that we have identified these values, we must create dummy variables for each. We do this using the generate command.

Type:

gen loc1 = location==1

gen loc2 = location==2

gen loc3 = location==3
 

We created three new dummy variables (loc1, loc2, and loc3), one for each value of location. From here we are ready to graph the distribution. To create a bar graph for the categorical variable location type:

#delimit ;
graph bar loc1 loc2 loc3;

The command above creates the following bar graph:

 

The graph contains three distinct columns, one for each of the location values. The height of each column reflects the percent of observations that fall into that given category. From the graph we see that the number of respondents living in rural areas is much larger than number of urban respondents or urban village respondents. This initial bar graph is informative, of course, but if we left it as is we would be wasting Stata's great graphing capabilities. Let's try to make it better. Type:

graph bar loc1 loc2 loc3,
title("Location of Respondent's Household")
note("Source: Botswana AIDS Impact Survey 1")
caption("Botswana Distance Learning Project")
ylabel(0(.10).50, angle(horizontal)) ytick(0(.05).50) ymtick(0(.025).50)
blabel(bar, position(inside) format(%9.2f) color(white))
legend(label(1 "Urban") label(2 "Urban Village") label(3 "Rural"))
bargap(25);

 

 

Unlike bar graphs, when creating a pie graph it is not necessary to construct indicator variables for each value in the location variable distribution. Although if you have already created these indicator variables, you can use these new measures just as before to create a pie graph. All we have to do is switch the "bar" option to the "pie" option (i.e. - graph pie loc1 loc2 loc3).

Now, to create a pie graph for the categorical variable location without using the newly created dummy variables type:

graph pie, over(location)
title("Location of Respondent's Household")
note("Source: Botswana AIDS Impact Survey 1")
caption("Botswana Distance Learning Project")
pie(1, color(green))
pie(2, color(orange))
pie(3, explode color(red))
plabel(_all percent, color(white))
legend(label(1 "Urban") label(2 "Urban Village") label(3 "Rural") row(1) order(3 2 1));

 

The command above creates the following pie graph:

 

The command that created this pie graph should be fairly intuitive, but let's take a closer look to make sure. First off, the actual pie command can take three different forms:

  1. it can take the form used above where we told Stata to graph pie, using the variable location
    type:
    graph pie, over(location);
  2. we could have used the same variables we used for the bar graph (loc1, loc2, loc3)
    type:
    graph pie loc1 loc2 loc3; or
  3. we can tell Stata to create a pie graph of location by another variable, say gender
    type:
    graph pie loc1 loc2 loc3, by(gender).

The title, note, and caption specifications should be clear - just remember to enclose your text in quotes for each of these. The pie specification allows you to format each of the pie slices if you like. In the example above, our variable location has three categories, thus we have three pie slices to specify if we choose to. In our example, we told Stata to color slice number 1 green and to color the second slice orange. For the third slice, we told Stata to not only color it red, but also to "explode" it from the other 2 slices, which emphasizes the Rural slice.

We also specified the plabel (pie label), which is analogous to the blabel (bar label) we used in the bar graph above. With it we told Stata to label _all of the pie slices with their corresponding percent and to present that number in the color white. Finally, using the legend option, we were able to modify the default label of each category in our variable. If you notice, we also instructed Stata to present the labels in a single row and to order the labels in the suggested order. The default order is based on the values of the categories. In our case, category 1 is Urban, the second is Urban Villages, and the last one is Rural. The default ordering is (1 2 3), which would be fine but in this example we decided to switch the order of the legend labels to present Rural then Urban Villages and finally Urban.

NOTE: the suboptions for the pie slices, pie labels, and the legend must be specified within parentheses as shown in the examples above, also known as round-brackets.

Overall, however, we see that a majority of the BAIS households are located in rural settings. The distribution displayed in the pie graph is identical to the distribution displayed in the bar graph, although the presentation is a bit different. In this case, the pie graph appears to be better at showing and calling attention to a particular category - rural in this example as we separated it from the remaining values.

Now that we have learned some of the basics to graphing categorical variables, let's turn to some of the more helpful advanced commands. At times, it can be quite tedious creating dummy variables when the given categorical variable has a large number of categories. Luckily, Stata allows us to create dummy variables much quicker and easier. Instead of using the generate and replace commands to create dummy variables for each category of a given variable, we can use the combination of two commands: tabulate and generate.

For example, if we were to graph the categorical variable cause_1 (which represents "the cause of the most recent death in the household" and has 12 categories) it would take quite a bit of time to create the necessary dummy variables. Although, by using the tabulate and generate commands together, these dummy variables can be created in one step.

First, however, since this variable is a household-level variable, we need to recode it into a new variable that accounts for the household size. In other words, we want to construct a new variable that will have a non-missing observation for only one member of a given household instead of all members of the household.  This is necessary because Stata is unable to determine the level of variable we are dealing with, as well as the level of analysis we are interested in. We have to specify the level of variable for Stata when we don't want a measure treated as an individual-level variable. We can do this using the [_n] command from above by simply typing:

sort hhid
generate hhcause1=cause_1 if hhid~=hhid[_n-1]

Now when we create a graph for this new categorical variable, each household will only contribute a single piece of information. If we did not take this initial step, each household would contribute X pieces of information, where X is equal to the size of a given household.  So if we did forget to take this first step, our graph would be biased. Larger households in the BAIS data file would contribute more information than smaller households. The bigger the household, the more times that household's cause_1 value would be used in constructing our new graph.

So, after creating the new hhcause1 variable, we can now create the multiple dummy variables that we need by typing the following into Stata:

tab hhcause1, gen(temp)

All the necessary dummy variables will be constructed. The 12 new dummy variables will be named temp1, temp2, temp3, and so on. The name "temp" (which is placed after the generate command in parentheses or round brackets) was arbitrarily assigned and could be replaced by any combination of letters and numbers. These new dummy variables all have values of 0 and 1. Thus, temp3 is set to equal 1 for all observations in which the hhcause1 value is equal to 3, the variable temp10 is set to equal 1 for all observations in which the hhcause1 value is equal to 10. After using this quick method of creating dummy variables, we simply type the following to graph the hhcause1 variable:

graph pie temp1 - temp12

 

 

As you see, the command above creates a very basic pie graph for the categorical variable hhcause1. Let's make it look better! Type:

graph pie, over(hhcause1)
title("Percent of the Cause of Most Recent Death in Household")
note("Source: Botswana AIDS Impact Survey 1")
caption("Botswana Distance Learning Project")
plabel(_all percent, size(*0.75) format(%9.0f) color(white))
legend(label(1 "AIDS") label(2 "TB") label(3 "Malaria")
label(4 "Maternal Death") label(5 "Heart Disease") label(6 "Stroke")
label(7 "Violence") label(8 "Road Accident") label(9 "Malnutrition")
label(10 "Other") label(11 "User-missing") label(12 "Don't Know")
col(3));

 

 

Now this is a lot better. The specification of the pie graph should be fairly intuitive to you by now. If not, please review the notes above. There are a couple of new options that we specified, however, that you should be aware of. First, unlike the first hhcause1 graph, which used the dummy variables (temp1 thru temp12) we specified this second pie graph using the over(hhcause1) option. This instructs Stata to use the values in hhcause1 as the slices of the pie. Therefore, it is not necessary to create dummy variables to create a good looking pie graph!

Also note that we specified, size(*0.75) format(%9.0f), which tells Stata to adjust the size of the labels by 75% of the default size and to format the values to have no decimal points. Finally, notice that we told Stata to relabel the legend and to display the names in 3 columns.

The by option is another helpful feature when creating graphs. The by option can be used to create separate pie and bar graphs for each category of an additional variable. For example, if we were interested in safe sex practices and how they vary by gender, we could use a pie graph along with the by option.  As a first step, we could examine the distribution of the variable sexcond, which represents whether the respondent or the respondent's sexual partner used a condom the last time they had sex in exchange for gifts or money. To do this, we could create a basic pie graph using sexcond. Type:

graph pie, over(sexcond)
legend(label(1 "Yes") label(2 "No"));

 

 

From the pie graph it is clear that a majority of these pairings (respondent and sexual partner) did not use condoms during sex where gifts or money were exchanged. Although this is certainly useful information, it does not allow us to answer our original question (we were interested in safe sex practices and how they vary by gender). To get at this question what we really want is two separate graphs, one for men and one for women.

If we want a separate pie graph of the variable sexcond for men and women, we would once again need to create a set of dummy variables for the variable sexcond. We begin by creating the new dummy variables: 

tab sexcond, gen(tmp)

Then we type:

sort gender
graph pie tmp1 tmp2, by(gender)
legend(label(1 "Yes") label(2 "No"));

 

 

The commands above produce two separate pie graphs, one representing the distribution of sexcond for women and another representing the distribution for men. Once again, however, the basic command produces a graph that needs some enhancing. Let's try the following:

graph pie tmp1 tmp2,
by(gender,
title("Percent Usage of Condom by Sexual Partner")
note("Source: Botswana AIDS Impact Survey 1")
caption("Botswana Distance Learning Project"))
legend(label(1 "YES") label(2 "NO") col(3))
plabel(_all percent, color(white));

 

 

The syntax that we used for this last graph should be fairly easy to understand. You should note, however, the title, note, and caption are all specified within the by(gender) command.

So now that we have our graph cleaned up, we can answer our original question (we were interested in safe sex practices and how they vary by gender). Looking at the newest graph it is clear that a difference does exist by gender.  The men in the BAIS sample were much more likely to use a condom than the partners of women, when having sex in exchange for gifts or money.  The next step would be to explore additional variables that may impact this relationship. By doing this we could gain a better understanding of this finding.  In the next module, we will begin to learn how to do exactly that.  Another thing to keep in mind as we progress through the remaining modules are the sample sizes we are dealing with.  Sample sizes can impact your findings, so you should always keep this in mind prior to making any conclusions.  For example, in our previous example where we were looking at differences by gender, our sample sizes were quite small.  Given the size of the sample used, we would want to be very cautious in making any definite conclusions.

 

 

EXERCISES

Now it is your turn to explore the distributions of variables using the commands from this module.

Using Stata and the BAIS data set, answer the following questions.

  1. Is the variable "sex" categorical or continuous? Considering this, how would you graph this variable in Stata?
  2. Exercise 1 Answer
  3. What percentage of the sample is made up of sons and daughters?
  4. Exercise 2 Answer
  5. Which of the religious groups makes up the largest proportion of urban residents?
  6. Exercise 3 Answer
  7. How many household heads are female in the BAIS data?
  8. Exercise 4 Answer

 

BACK TO TOP