TABLE OF CONTENTS
Introduction
Column, Row and Cell
Summarize
Egen
Example: Exploring STD Symptoms Variables
Chi-Squared: Testing for Independence
Exercises
INTRODUCTION
Up to this point we have restricted our analysis to one variable at a time. This is certainly useful, although restricting our analysis to a single variable can be misleading. For example, we have learned that to produce the average number of visits to the clinic while pregnant using Stata, we type:
means pregno
The above command produces a table of the Arithmetic, Geometric, and Harmonic means for the entire sample. From this table we see that the average number of visits to the clinic is approximately 7.4. When thinking about clinic visits, one can think of several factors that may affect when and how often a women decides to visit a clinic while pregnant.
Is it likely that an individual with a college degree will visit a clinic more while pregnant than an individual with no formal education?
Hence, while reporting the mean number of clinic visits for the entire sample is useful, examining how the number of visits varies by a second variable can be even more helpful in discovering trends in the data.
In this module, we will examine the relationship between two variables (bivariate analysis) using crosstabs. A crosstab is a technique for analyzing the relationship between two variables that have been organized in a bivariate table. Using such a table, we can examine the presence and strength of the relationship between two variables.
What is bivariate analysis? Bivariate Analysis is the examination of two variables at the same time, hence the name bivariate. It is used frequently by social scientists and mathematicians to compare how two variables correspond with one another. While sophisticated equations can be written to model how one variable changes with respect to another (regression, the subject of Module 6), we are only concerned here with any two variables, whether they are mathematically related or not.
When would we use bivariate analysis? Although it can be used any time we have two variables that we want to examine at the same time, bivariate analysis is a good tool to use when we have a hunch that two variables "go together." If this is the case then we can compare them numerically.
For example, if we were interested in understand why pregnant women visit a clinic, there are several variables that we might want to examine in relation to pregno. We should begin with the fact that women with health complications are certainly more likely to visit a clinic, this is something we definitely want to keep in mind. In addition to this, it would be informative to know how old the women are. Is it the case that older women make it to the clinic more often than younger women? We could examine how education is related to the number of visits we have observed. One might hypothesize that women with more education would be more likely to visit the clinic. Additionally, we may want to look at the relationship between the number of visits and the location of the women's residence. It may be the case that the clinic is quite far away from the respondent's home, decreasing the likelihood of a visit. These are just a few of the possible variables we may want to examine in relation to pregno (although there are numerous other possible relationships to examine, such as labor force status, type of occupation, religion, health status).
As a first step, we would first want to begin by finding these variables of interest using the BAIS questionnaire, as well as the Stata command lookfor. For example, we might be interested in examining the bivariate relationship between number of visits to the clinic and the respondents employment. To find variables related to employment we could use some of the following commands:
lookfor employment
lookfor work
lookfor labour
These commands will provide us with a list of variables that hopefully will be related to our area of interest. After finding the variables we are interested in, we would want to simply examine the variable alone to get a sense of the measures frequency distribution, mean, maximum and minimum values, etc.
Finally once we have a general understanding for all the variables we are working with, we can examine the relationship between the variables by doing bivariate analysis. Later we will expand these techniques to include three (trivariate) and four (quadivariate) variables.
How we do bivariate analysis in Stata? While the list command is very simple, it is not the most informative when we are trying to look at more than one variable at a time. Thank goodness there is another option - cross-tabulations. Suppose we want to look at the relationship between location and the incidence of HIV. We may be interested in examining whether respondents in rural areas are more likely to know someone who has HIV than respondents in urban areas.
To examine this relationship, we should start by finding the variables we want to use in the analysis. Using the lookfor command in Stata is a good place to start (for example: lookfor region and lookfor hiv). We'll find two key variables location and hivknow. We can tabulate each variable individually, but that isn't very helpful when trying to figure out how the two variables are related. There is only so much we can learn from tabulating each of these variables individually.
tab location
location of |
household | Freq. Percent Cum.
---------------+-----------------------------------
Urban | 1808 23.38 23.38
Urban Villages | 2019 26.11 49.48
Rural | 3907 50.52 100.00
---------------+-----------------------------------
Total | 7734 100.00
tab hivknow
do you know |
anyone who |
has hiv | Freq. Percent Cum.
-------------+-----------------------------------
Yes | 1123 30.58 30.58
No | 2507 68.27 98.86
Don't Know | 42 1.14 100.00
-------------+-----------------------------------
Total | 3672 100.00
That's a good start, but we still haven't tabulated the two variables together. A bivariate table (or crosstab) is simply a table that displays the distribution of one variable "across" the categories of a second variable. To create a bivariate table in Stata, we use the tabulate command, and instead of specifying a single variable we specify two.
The command is very simple: tab variable1 variable2.
The first variable is treated as the row variable and the second is the column variable.
tab location hivknow
location of | do you know anyone who has hiv
household | Yes No Don't Kno | Total
---------------+---------------------------------+----------
Urban | 421 594 12 | 1027
Urban Villages | 285 622 11 | 918
Rural | 417 1291 19 | 1727
---------------+---------------------------------+----------
Total | 1123 2507 42 | 3672
Looking at the cross tabulation above, there are several things worth noting:
First, we see that on the margins of the table, Stata provides us with each variable's original frequency distribution (can find these values in the "Total" columns).
Second, looking at the cell frequencies it is clear that in each of the locations (urban, urban villages, and rural areas) more respondents don't know anyone with HIV than do know someone with HIV.
Third, while the previous statement is true, the ratio of Yes/No is not the same across the different locations. This suggests that there may be a relationship between these two variables, although it is hard to tell at this point.
Let's see if we can't clean our bivariate table up a bit, and possibly add some more description information, to help us better understand the relationship between these two variables.
As a first step, we might want to eliminate the third column in our table which consists of the "don't know" values. It's hard to know how to interpret these values and thus they aren't adding any information to our crosstab. There are two methods we can use to deal with these values. We can drop this set of values by creating a new variable using the original variable hivknow.
gen hivknow2=hivknow
replace hivknow2 = . if hivknow2 == 9
The first command above creates an exact copy of the variable
hivknow called
hivknow2.
Once we have created this exact copy, we are able to replace the "don't know"
values to missing. Another method for dropping the "don't know" values from our
analysis is to simply replace the values of the original
hivknow
variable. We can do this using the following commands:
replace hivknow=. if hivknow==9
This is certainly quicker than the method we used previously, although there is one item to keep in mind when replacing any values of an original variable. If we were to save our bais.dta data file, we would no longer be able to distinguish the "don't know" values from the other missing values in the distribution. Because of this, it is often common practice to create a new variable rather than alter the original data permanently. For this example, we can use this second method, although you should be sure not to save your data file unless you no longer will need to work with the "don't know" values.
Now that we have recoded the "don't know" values in the hivknow distribution, we can recreate our crosstab.
tab location hivknow
| do you know anyone
location of | who has hiv
household | Yes No | Total
---------------+----------------------+----------
Urban | 421 594 | 1015
Urban Villages | 285 622 | 907
Rural | 417 1291 | 1708
---------------+----------------------+----------
Total | 1123 2507 | 3630
Our new table is a good start, but there is a long list of factors that could be influencing the relationship we are observing. For example, it may be the case that gender differences exist in the likelihood of knowing someone who is HIV positive. Luckily, Stata provides a way to restrict our bivariate table to specific populations. As is the case with almost all Stata commands, we are able to use qualifiers to limit the respondents included in our analysis. To examine the impact of gender on the current bivariate relationship, we can use the following commands:
tab location hivknow if gender == 1
| do you know anyone
location of | who has hiv
household | Yes No | Total
---------------+----------------------+----------
Urban | 207 288 | 495
Urban Villages | 106 260 | 366
Rural | 191 577 | 768
---------------+----------------------+----------
Total | 504 1125 | 1629
tab location hivknow if gender == 2
| do you know anyone
location of | who has hiv
household | Yes No | Total
---------------+----------------------+----------
Urban | 214 306 | 520
Urban Villages | 179 362 | 541
Rural | 226 714 | 940
---------------+----------------------+----------
Total | 619 1382 | 2001
Once again we have expanded and improved our bivariate tables, although it is still a bit difficult to make any conclusions from the frequencies alone. In the next several sections of the module, we will cover options that will help to make your crosstabs much easier to interpret. After we have mastered these options, we will return to this question to see if we can make more definitive conclusions concerning the relationship between location and knowledge of an HIV positive individual.
The tab ... , missing Command
Sometimes, we may want to include missing values in our calculations. While the tabulate command is used to produce one- and two-way tables of frequency counts, the missing option can be included to request that missing values be treated like other values in calculations of counts, percentages, and other statistics. By default, Stata will generate tables without the missing values, unless we specify that it do so. The basic syntax is:
tab variable1 variable2, missing
Say we wanted to examine the relationship between the use of condoms during sex (part1con1) and the type of relationship the individuals are in (part1), and wanted to include the missing values in our tabulation. We would enter the following command:
tab part1 part1con1, missing
| did you use condom the first
| time had sex with partner1
most recent partner | Yes No . | Total
----------------------+---------------------------------+----------
Husband/Wife | 105 403 6 | 514
Live-in partner | 244 234 4 | 482
Girl/Boy friend not l | 810 230 8 | 1048
Casual acquaintance | 22 5 0 | 27
Other | 1 0 0 | 1
. | 0 0 5662 | 5662
----------------------+---------------------------------+----------
Total | 1182 872 5680 | 7734
Notice that the last row of the table demarked by a period "." contains the data for the missing values. Without the missing option included, this row will not appear. By including the missing option, we have included the entire BAIS sample in the crosstab. Looking at the frequencies, it seems that as the seriousness of the relationship increases, the likelihood of using a condom goes down. Although, as in previous examples, it is a bit difficult to make this conclusion for sure as the relative group sizes vary (meaning there are more Girl/Boy friends than Husband/Wives, which makes comparing numbers difficult). In the next section we will learn how to deal with this issue to make interpretation easier.
But first, now that you've learned these new methods, try answering the following questions using crosstabs:
- 1. How many respondents over the age of 20 would want to be tested for HIV and know of a place to get tested?
- Question 1 Answer
- 2. How many households reside in a detached house with a piped indoor water source that is located in a rural area?
- Question 2 Answer
Column, Row and Cell
Column. Thus far, using the tabulate command has been very useful in learning information about our data. Suppose, however, we want to re-examine the relationship between the use of condoms during sex (part1con1) and the type of relationship the individuals are in (part1). We might hypothesize that of those using condoms, most of these partners will be the respondent's spouse. So more specifically, what if we wanted to know what percentage of those using condoms during sex were used with their spouse? Using our Stata command knowledge up to now, we could enter:
tab part1 part1con1
That yields the following table:
| did you use condom
| the first time had
| sex with partner1
most recent partner | Yes No | Total
----------------------+----------------------+----------
Husband/Wife | 105 403 | 508
Live-in partner | 244 234 | 478
Girl/Boy friend not l | 810 230 | 1040
Casual acquaintance | 22 5 | 27
Other | 1 0 | 1
----------------------+----------------------+----------
Total | 1182 872 | 2054
Looking at the table, we know that of the respondents that did use a condom during sex, 105 of these respondents were with their spouse. But is this a relatively large or small number compared to the other partner types? We could carefully check through all the values and compare and contrast, but a more useful and efficient way to find this answer would be to run the same table, but instead of frequencies use percentages. To do this in Stata, enter:
tab part1 part1con1, column
| did you use condom
| the first time had
| sex with partner1
most recent partner | Yes No | Total
----------------------+----------------------+----------
Husband/Wife | 105 403 | 508
| 8.88 46.22 | 24.73
----------------------+----------------------+----------
Live-in partner | 244 234 | 478
| 20.64 26.83 | 23.27
----------------------+----------------------+----------
Girl/Boy friend not l | 810 230 | 1040
| 68.53 26.38 | 50.63
----------------------+----------------------+----------
Casual acquaintance | 22 5 | 27
| 1.86 0.57 | 1.31
----------------------+----------------------+----------
Other | 1 0 | 1
| 0.08 0.00 | 0.05
----------------------+----------------------+----------
Total | 1182 872 | 2054
| 100.00 100.00 | 100.00
Here, Stata has calculated the percentages based on the total of each column. We see that of all the people using condoms during sex, 8.88% were with their husband or wife. Although, this is not all we can take away from the table above. Using the tab, column option provides these percentages for each relationship type for both columns in the table. Overall, this is a very powerful option, as it allowed us to answer our question definitively, without any additional calculations on our part
.
Row. Ok, now let's try a slightly different question: Of all the respondents having sex with casual acquaintances, what percentage did not use condoms? From the table above, we could find that answer, but we would have to use our calculator! Luckily for us, Stata can do the work for us - if we know how to ask it! Here's how:
tab part1 part1con1, row
| did you use condom
| the first time had
| sex with partner1
most recent partner | Yes No | Total
----------------------+----------------------+----------
Husband/Wife | 105 403 | 508
| 20.67 79.33 | 100.00
----------------------+----------------------+----------
Live-in partner | 244 234 | 478
| 51.05 48.95 | 100.00
----------------------+----------------------+----------
Girl/Boy friend not l | 810 230 | 1040
| 77.88 22.12 | 100.00
----------------------+----------------------+----------
Casual acquaintance | 22 5 | 27
| 81.48 18.52 | 100.00
----------------------+----------------------+----------
Other | 1 0 | 1
| 100.00 0.00 | 100.00
----------------------+----------------------+----------
Total | 1182 872 | 2054
| 57.55 42.45 | 100.00
From these new results, we learn that 18.52% of respondents having sex with casual acquaintance did not use a condom. The table provides additional information as well. For example, from the Total Row at the bottom of the table, we see that most of the sample did use condoms when having sex with partner1 (57.55%).
Cell. Another question: Excluding missing cases, what percentage of the sample used a condom during sex and was with a girl or boyfriend? To find the answer to this question, type the following:
tab part1 part1con1, cell
| did you use condom
| the first time had
| sex with partner1
most recent partner | Yes No | Total
----------------------+----------------------+----------
Husband/Wife | 105 403 | 508
| 5.11 19.62 | 24.73
----------------------+----------------------+----------
Live-in partner | 244 234 | 478
| 11.88 11.39 | 23.27
----------------------+----------------------+----------
Girl/Boy friend not l | 810 230 | 1040
| 39.44 11.20 | 50.63
----------------------+----------------------+----------
Casual acquaintance | 22 5 | 27
| 1.07 0.24 | 1.31
----------------------+----------------------+----------
Other | 1 0 | 1
| 0.05 0.00 | 0.05
----------------------+----------------------+----------
Total | 1182 872 | 2054
| 57.55 42.45 | 100.00
This time, Stata gives us a table showing both the frequencies and the percentages by cell. From this we learn that 39.44% of the entire sample (not including missing cases) was having sex with a boyfriend or girlfriend and was using a condom.
Each of these options can be very helpful, their individual use depends on what question we want to answer. Remember to consider what denominator best gets at the desired answer.
- 3. What percentage of men work as an employee (paid in cash)?
- Question 3 Answer
- 4. What percentage of employees (paid in cash) are men?
- Question 4 Answer
SUMMARIZE
Summarize Option within a Cross-tabulation Analysis. In this section, we will discuss a command that enables us to create a tri-variate analysis, as opposed to a two-way cross tabulation, which only gives frequencies or percentages for two variables.
For example, if we want to see a distribution of male and female respondents in rural, urban village, and urban areas, as we did in previous lessons, we can find this by typing:
tab location gender
location of | sex of respondent
household | Male Female | Total
---------------+----------------------+----------
Urban | 906 902 | 1808
Urban Villages | 880 1139 | 2019
Rural | 1897 2010 | 3907
---------------+----------------------+----------
Total | 3683 4051 | 7734
Now, let's go for a step further. What is the average age at first marriage for men and women living in different areas? What would be the best table to create?
By combining commands that we have learned from past modules, we can already answer this question. One way would be:
sort location
by location: tab gender, sum(agemar)
_______________________________________________________________________________ -> location = Urban
sex of | Summary of age at first marriage
respondent | Mean Std. Dev. Freq.
------------+------------------------------------
Male | 30.117647 8.1158983 187
Female | 24.059406 5.5392256 202
------------+------------------------------------
Total | 26.971722 7.5270456 389
_______________________________________________________________________________ -> location = Urban Vi
sex of | Summary of age at first marriage
respondent | Mean Std. Dev. Freq.
------------+------------------------------------
Male | 29.485437 7.8363958 103
Female | 24.588235 6.879855 170
------------+------------------------------------
Total | 26.435897 7.6218359 273
_______________________________________________________________________________ -> location = Rural
sex of | Summary of age at first marriage
respondent | Mean Std. Dev. Freq.
------------+------------------------------------
Male | 28.872038 7.4352907 211
Female | 24.645062 8.3083348 324
------------+------------------------------------
Total | 26.31215 8.2322185 535
These results suggest that the female respondent's average age at first marriage is lower than men's at every location. This table also provides the standard deviations associated with each of the means and the raw frequencies too. While these results are informative, they are not efficient. It would be better to create a table that shows all the necessary statistics in one table. To do so, we can type:
tab location gender, sum(agemar)
Means, Standard Deviations and Frequencies of age at first marriage
location |
of | sex of respondent
household | Male Female | Total
-----------+----------------------+----------
Urban | 30.117647 24.059406 | 26.971722
| 8.1158983 5.5392256 | 7.5270456
| 187 202 | 389
-----------+----------------------+----------
Urban Vil | 29.485437 24.588235 | 26.435897
| 7.8363958 6.879855 | 7.6218359
| 103 170 | 273
-----------+----------------------+----------
Rural | 28.872038 24.645062 | 26.31215
| 7.4352907 8.3083348 | 8.2322185
| 211 324 | 535
-----------+----------------------+----------
Total | 29.463074 24.461207 | 26.55472
| 7.7818464 7.2478181 | 7.869954
| 501 696 | 1197
Now that's much better. The results can now be easily compared. Note that we simply specified the summarize option to tell Stata to summarize agemar within the table. These results shows us that while women's average age is fairly constant across locations, the average age at first marriage decreases as the level of development decreases for men. If, for example, we were solely interested in the mean values and not the frequencies (or counts), we can also specify the mean option, which will create a simpler table. Try it, type:
tab location gender, sum(agemar) mean
Means of age at first marriage
location |
of | sex of respondent
household | Male Female | Total
-----------+----------------------+----------
Urban | 30.117647 24.059406 | 26.971722
Urban Vil | 29.485437 24.588235 | 26.435897
Rural | 28.872038 24.645062 | 26.31215
-----------+----------------------+----------
Total | 29.463074 24.461207 | 26.55472
To get more familiar with these new options, try the following exercise:
- 5. How does the average level of drinking (days did drink in last 4 weeks) among men and women vary by work status (are you working for pay)?
- Question 5 Answer
EGEN
When using Stata to analyze this data set or other data sets, there will be many times when you will want to create a variable that combines data in the original data set in more expansive ways. The egen command in Stata is extremely handy in assisting with this type of variable construction. In this module, we will cover only a few of the more important uses of this command. It would certainly be to your advantage to get acquainted with more of available options for egen in the help section of Stata.
The egen command allows variables to be created using functional commands that combine different variables in many different and important ways. This command is also an excellent complement to the "[_n]" command as it allows us to create household level variables.
Max Function. The best way to learn how to use the egen command is through examples, so let's start with an example using the egen function max. Suppose we are interested in learning about the elderly in households. We may be interested in constructing a variable that is equal to the age of the oldest individual in a given household. Up to this point all the variables we have constructed have been individual level measures. Constructing household level measures with our current Stata skills would be incredibly difficult, and would involve several generate and _n commands. Fortunately, Stata provides us with an advanced or "extended" generate command that makes constructing advanced variable types quick and easy.
Let's begin with a more general version of our desired variable, and then we can move on from there. As a first pass, let's create a variable that is equal to the age of the oldest individual in BAIS data file. To construct such a measure, we can use the following egen syntax:
egen maxage = max(age)
Let's examine more closely what this command tells Stata to do. It tells Stata to create a variable called maxage that is set equal to the age of the oldest person in the data file. To verify this command, type the following commands:
sum age
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
age | 7734 24.44375 19.7285 0 98
list maxage
maxage 1. 98 2. 98 3. 98 4. 98 5. 98 6. 98 7. 98 8. 98 9. 98 10. 98
Looking at the output of the first command we see that the oldest individual in the entire BAIS data file is 98 years old. So given this, our second command should have created a variable (maxage) that is equal to 98 for everyone in the data file. Looking at the list output we can verify that this is the case.
Now, practically you most likely won't want a variable that is set equal to the oldest individual in the entire data set. Although it could be the case that you would want a variable that is set equal to the oldest individual in a given household. To restrict the variable maxage in this way, we need to add the by() option to our previous egen command. Let's add this option and create a second variable called maxage2 and then examine what the distribution of this new variable looks like. We can do this using the following syntax:
egen maxage2 = max(age), by(hhid)
sort hhid
list hhid age maxage2
The command above creates the following output:
hhid age maxage2 1. 1 45 45 2. 1 36 45 3. 1 8 45 4. 1 6 45 5. 1 3 45 6. 2 56 56 7. 2 46 56 8. 2 18 56 9. 2 16 56 10. 2 13 56 11. 3 26 26 12. 3 26 26 13. 3 24 26 14. 4 27 27 15. 5 29 29 16. 5 26 29 17. 6 36 36 18. 6 7 36 19. 6 5 36 20. 6 19 36 21. 6 25 36 22. 6 0 36
Looking at the output above we see that adding the by(hhid) option did in fact make the necessary restrictions. The new variable maxage2 is a household level variable that is equal to the age of the oldest individual in a given household.
Now that we have our new variable maxage2, we are free to use it in additional analysis. For example, we may want to examine how the distribution of the maxage2 variable varies by location. To do this we could use the tab, sum() command, although we must remember that the variable maxage2 is a household level variable and thus we must restrict our analysis to households. To do this, we can use the system variable _n. To produce our table we would use the following:
sort hhid
tab location if hhid~=hhid[_n-1], sum(maxage2)
location of | Summary of maxage2
household | Mean Std. Dev. Freq.
------------+------------------------------------
Urban | 39.126984 13.017031 504
Urban Vil | 47.568675 18.177453 415
Rural | 51.630359 18.669089 863
------------+------------------------------------
Total | 47.148148 17.925839 1782
We can see that in terms of the oldest individuals in a given household, urban households are much younger than households found in urban villages or rural areas. Hopefully after working through this example it is a bit clearer how to use the egen command, as well as the max function. The max function simply finds the maximum value of the variable specified and sets the new variable equal to that value. In our example, we used the max function to find the largest or maximum age value in a given household.
Min Function. It is these functions (like max) that make the egen command one of the most powerful data management tools in Stata. It is also worth noting that there are numerous functions that one can use with the egen command. For example, there is a min function which is the exact opposite of the max function. The min function is used to find the smallest value in a given distribution. So we could use the min function to create a variable that is equal to the age of the youngest individual in a given household. To create such a variable, we could type:
egen minage = min(age), by(hhid)
Mean Function. Let's try and use another egen function to create a new household level variable. For example, what if we were interested in the variable nightaway, which represents the number of nights spent away from the residence. Using the previous egen functions max and min we could create new variables set to the maximum number of nights away or the minimum number of nights away. Although, it may be the case that we are interested in constructing a variable that is equal to the average number of nights away from the residence by household. To construct this variable we would need to use the mean function. Using the mean function, we could type the following:
egen avgaway = mean(nightaway), by(hhid)
Again, let's review what this command tells Stata to do. The egen command tells Stata to create a variable called avgaway that is set equal to the average number of nights away for that individual's household. As previously noted, it is instructive to ponder what would happen if the by(hhid) was omitted from the command? If that were the case, then the command would tell Stata to create a variable called avgaway that is set equal to the average number of nights away for the entire data set.
Now that we have our new variable, let's check the values to ensure Stata did what we expected it to do. To check the new variable, type:
sort hhid
list hhid nightaway avgaway
hhid nighta~y avgaway 1. 1 14 14.5 2. 1 15 14.5 3. 1 . 14.5 4. 1 . 14.5 5. 1 . 14.5 6. 2 0 0 7. 2 . 0 8. 2 0 0 9. 2 0 0 10. 2 0 0 11. 3 8 8 12. 3 . 8 13. 3 . 8 14. 4 0 0 15. 5 0 0 16. 5 . 0 17. 6 26 19 18. 6 . 19 19. 6 . 19 20. 6 12 19 21. 6 . 19 22. 6 . 19
Looking at the output above we see that the new variable avgaway is equal to the average number of nights away. Although it is worth noting that this average only includes the non-missing responses found in each household. Missing nightaway observations are not included in the average. Finally, it is also worth mentioning that the new variable avgaway is a household level variable. Therefore, it is imperative to once again use the _n option when analyzing this variable.
Group Function. Yet another use of the egen command is to create a variable that is the combination of two variable distributions using the group function. Again, to really understand how to use this new function it is easiest to use an example. Let's begin by exploring the relationship between two variables: part1alc (last time had sex, did you or partner1 drink alcohol) and part1con2 (did you use condom last time had sex with partner1).
We can begin by examining the crosstab of these two variables. To do so, we can type:
tab part1alc part1con2
last time |
had sex, did |
you or |
partner1 | did you use condom last time had
drink | sex with partner1
alcohol | Yes No User-miss | Total
-------------+---------------------------------+----------
Yes | 122 141 0 | 263
No | 995 790 1 | 1786
Don't know | 9 8 0 | 17
-------------+---------------------------------+----------
Total | 1126 939 1 | 2066
The table above provides us with the crosstabulation of the two variables of interest, although as we have discussed previously it is easier to examine this type of output with percentages included with the frequencies. For our purposes, let's look at what percentage of those who did and did not use condoms drank alcohol. To produce these percentages we use the following:
tab part1alc part1con2, column
last time |
had sex, did |
you or |
partner1 | did you use condom last time had
drink | sex with partner1
alcohol | Yes No User-miss | Total
-------------+---------------------------------+----------
Yes | 122 141 0 | 263
| 10.83 15.02 0.00 | 12.73
-------------+---------------------------------+----------
No | 995 790 1 | 1786
| 88.37 84.13 100.00 | 86.45
-------------+---------------------------------+----------
Don't know | 9 8 0 | 17
| 0.80 0.85 0.00 | 0.82
-------------+---------------------------------+----------
Total | 1126 939 1 | 2066
| 100.00 100.00 100.00 | 100.00
From the table we see that for both groups, those who did and those who didn't use condoms with their partners, neither was very likely to have drank alcohol the same night. As a researcher interested in the question, we may want to examine this relationship further to see if other variables interact with these findings. One method of assisting this pursuit is combining the distributions of the two variables to allow for more complex comparisons. The quickest way of doing this is by using the egen command with the group function. Using the group function, we are able to create a new categorical variable that contains the information from both the part1alc and part1con2 variables. To do this in Stata we use the following:
egen temp1 = group(part1alc part1con2)
Let's make sure we understand what we just told Stata to do. The egen varname = group(varlist) option tells Stata to create a variable that takes on the values 1, 2, ... for the groups formed by the variables specified within the group( ) option - part1alc and part1con2 in our case. More specifically, it tells Stata to create one variable with values that corresponds to those that drank alcohol and used a condom, those who drank and didn't use a condom, those who did non drink and did use a condom, and finally those who did not drink and did not use a condom. Let's again look at the crosstab of part1alc and part1con2 first:
tab part1alc part1con2
last time |
had sex, did |
you or |
partner1 | did you use condom last time had
drink | sex with partner1
alcohol | Yes No User-miss | Total
-------------+---------------------------------+----------
Yes | 122 141 0 | 263
No | 995 790 1 | 1786
Don't know | 9 8 0 | 17
-------------+---------------------------------+----------
Total | 1126 939 1 | 2066
This simple crosstab tells us the frequency count of each combination of part1alc and part1con2. Now, let's look at our newly created variable, temp1:
tab temp1
group(part1 |
alc |
part1con2) | Freq. Percent Cum.
------------+-----------------------------------
1 | 122 5.91 5.91
2 | 141 6.82 12.73
3 | 995 48.16 60.89
4 | 790 38.24 99.13
5 | 1 0.05 99.18
6 | 9 0.44 99.61
7 | 8 0.39 100.00
------------+-----------------------------------
Total | 2066 100.00
Comparing the two tables we see that our egen command has combined the distributions into a single variable. To make the connection of values across the two tables compare the frequencies in both tables. In the first crosstab, there were 122 respondents that both drank alcohol and use a condom during sexual intercourse. Looking at the second frequency distribution, value one has 122 observations, which means for the variable temp1 the value 1 represents respondents who both drank alcohol and use a condom during sexual intercourse. Try this with the other categories and you will see that every cell in the first crosstab is now represented by a single category in the temp1 distribution. When creating the values of the new variable, Stata orders the responses by taking the combinations of the first row from left to right, then the second row, from left to right, etc., etc., until all possible combinations in a cross tab are complete.
Now that we have our new variable there are a few minor steps we may or may not want to take. First, depending on the particular question you are attempting to answer, it may not be necessary to include the observations with either of the responses "Don't Know" or "User Missing". If this is the case, it is easy to remove these categories from the temp1 distribution. To recode this values to missing, we would simply use the following:
replace temp1 = . if temp1 > 4
tab temp1
group(part1 |
alc |
part1con2) | Freq. Percent Cum.
------------+-----------------------------------
1 | 122 5.96 5.96
2 | 141 6.88 12.84
3 | 995 48.58 61.43
4 | 790 38.57 100.00
------------+-----------------------------------
Total | 2048 100.00
As you can see, by recoding the values 5, 6, and 7 to missing, we removed the "Don't Know" and "User Missing" categories from our new variable. An alternative method would have been to replace these categories to missing for each of the individual variables (part1alc and part1con2) prior to executing our egen command.
Finally, while we have certainly created a useful variable that we can now use in further analysis. It would be useful to some how label the categories of this new variable. To create such labels, we need to use two commands in Stata.
The first step is to define the scheme we would like to apply to our variable temp1. Once we have established this label scheme we can then apply it to the new variable, or any other variable we like. To define a value label scheme, use:
#delimit ;
label define xyz
1 "drink, condom"
2 "drink, no condom"
3 "no drink, condom"
4 "no drink, no condom";
Executing the syntax above has now defined the value label scheme "xyz". Now we could have used any name here, it could have been named "zyx" or "g123456" or "temp1", we have just randomly selected "xyz" for this example. OK, so now that we have defined this label scheme, we are free to apply it to any variable in our data set. Although, we most likely only want to apply the scheme to the variable temp1. To do this, we use the following command:
label values temp1 xyz
The syntax above tells Stata to apply our new label scheme "xyz" to the variable temp1. Now let's see if it works:
tab temp1
group(part1alc |
part1con2) | Freq. Percent Cum.
--------------------+-----------------------------------
drink, condom | 122 5.96 5.96
drink, no condom | 141 6.88 12.84
no drink, condom | 995 48.58 61.43
no drink, no condom | 790 38.57 100.00
--------------------+-----------------------------------
Total | 2048 100.00
It seems that the previous command worked, instead of the values 1, 2, 3, and 4, our new value labels are in place. So after a few minor steps, we have created an extremely useful variable that will allows us to create more complex tables that will include more variables of interest.
Sum Function. Another useful egen function that could prove useful is the sum function. First off, it is important to note that the sum function is different than the actual command sum used to calculate the mean of a given variable. The sum function is used to sum or add the values of a given variable. This function can be extremely useful in several situations where the research would like a total number of some variable for a given household. For example, if we had a variable that represented individual respondent's income, we could use the sum function to add these incomes up within a household to create a household level income variable.
As a practical example, we may find that it is useful to construct a household size variable. There are numerous policy issues that deal with the size of a given household, and to this point we do not have this measure in our BAIS data file. To create this variable we will need to use two different commands. First, we need to create a variable that is equal to 1 for every respondent in the data file. To do this we can use the following:
gen all = 1
This command creates the new variable all which is equal to 1 for all respondent in the data file. Now that we have this new variable, we can sum or add these values up by household. By doing this we will create a household size variable. To add these values by household we can use the egen command with the sum function:
egen hhsize = sum(all), by(hhid)
To ensure that this command has done what we wanted let's check it. To check our new variable hhsize, use the following:
sort hhid
list hhid all hhsize
hhid all hhsize 1. 1 1 5 2. 1 1 5 3. 1 1 5 4. 1 1 5 5. 1 1 5 6. 2 1 5 7. 2 1 5 8. 2 1 5 9. 2 1 5 10. 2 1 5 11. 3 1 3 12. 3 1 3 13. 3 1 3 14. 4 1 1 15. 5 1 2 16. 5 1 2 17. 6 1 6 18. 6 1 6 19. 6 1 6 20. 6 1 6 21. 6 1 6 22. 6 1 6
Looking at the output above, it appears our command has worked. The egen command simply added or summed the variable all for each of the households, and then set the variable hhsize equal to that value. Now that we have our household size variable we are free to use it throughout the rest of the modules.
Count Function. The final egen function we will cover in this section of the module is the count function. The count function does exactly what it sounds like - count. Specifically, the count function counts the number of non-missing observations in a given distribution. So unlike the sum function that is adding values, the count function simply determines is this a missing observation or not. If the observation is a non-missing value, it will count it and add 1 to the running total, if it is a missing value Stata adds nothing to the total. To get a feel for how this works, let's work through another example.
Suppose we are interested in creating a variable that represents the number of household members under the age of 21. To create this variable we will need to use two different commands. First, we need to create a variable that is equal to 1 if the respondent is less than 21 years of age, and equal to missing if the respondent is 21 or older. To do this we can use the following:
gen flag20 = 1 if age < 21
This command creates the new variable flag20 which is equal to 1 if the respondent is less than 21 years of age, and equal to missing if the respondent is 21 or older. To check this for sure we could use:
tab age flag20, mis
From the output of the command above, we can see that it has done what we wanted. Now that we have this new variable flag20, we can now use egen with the count function to create our final variable. To create our desired variable, we use:
egen age20 = count(flag20), by(hhid)
The above command tells Stata to work through the data, household by household, and count all the non-missing observations for the variable flag20. What ever the number of non-missing values is, set the variable age20 equal to that total. To see how this worked we can use the following syntax:
sort hhid
list hhid age flag20 age20
hhid age flag20 age20 1. 1 45 . 3 2. 1 36 . 3 3. 1 8 1 3 4. 1 6 1 3 5. 1 3 1 3 6. 2 56 . 3 7. 2 46 . 3 8. 2 18 1 3 9. 2 16 1 3 10. 2 13 1 3 11. 3 26 . 0 12. 3 26 . 0 13. 3 24 . 0 14. 4 27 . 0 15. 5 29 . 0 16. 5 26 . 0
From the output, we see that Stata has done what we expected. For each household, the number of non-missing values are counted, and the age20 variable is set equal to that number. It is really important to point out that the actual number we used for the flag20 variable is irrelevant. Instead of the value 1, we could have used 94 and still gotten the same values for the age20 variable. When using the count function, all that Stata is counting is whether the value is missing or not. Stata isn't adding these values up, only counting them.
An Extended Example: Exploring STD Symptoms Variables
Now that we are empowered with the extended generate command - egen - let's use this new command to analyze a set of variables in the BAIS data.
Given the prevalence of STDs, and specifically HIV/AIDS, in the country of Botswana it is important from a policy perspective to evaluate the level of knowledge citizens have in relation to these issues. In the BAIS survey, respondents are asked a series of questions that should help us in understanding how knowledgeable the citizens of Botswana are concerning STDs. Specifically, respondents are asked the following two questions:
Q404. In a woman, what signs and symptoms would lead you to think that she has such a disease or infection?
Q405. In a man, what signs and symptoms would lead you to think that he has such an infection?
From a research perspective, it would be interesting to examine how many symptoms or signs respondents were able to produce. Further, it would be interesting to examine this number in relation to some of the other variables in the BAIS data to see if any relationships exist. Some possible questions we could possibly ask are:
- Are men more likely to be able to produce symptoms for men? Are women more likely to be able to produce symptoms for women?
- Are the number of symptoms that respondents are able to produce going to vary across the different locations within Botswana?
- Are the number of symptoms that respondents are able to produce related to their level of schooling?
All of these questions are very relevant and could contribute significantly to our understanding of how to control the spread of STDs in Botswana. Now the task is to construct these measures and then analyze the data to see what the answers to these questions are.
The first step in this process is to create a continuous variable for each of the questions presented above (Q404 and Q405). The problem is that the data is currently in a categorical form. For each of the possible symptoms respondents could have identified, a categorical measure has been created. For example, if a respondent responded with "abdominal pain" to question Q404, they were given a value of 1 for the variable stdsign1w. Let's look at the distributions of a few of these variables to get a better understanding of the data. To do this we can use the tabulate command:
tab1 stdsign1w stdsign2w stdsign3w
Before examining the output of the command above, let's learn why the number 1 follows the tab command. Instead of entering several tab commands one after the other to produce a single variables frequency distribution, like this:
tab stdsign1w
tab stdsign2w
tab stdsign3w
We can include the value "1" directly after the tab command and then simply list the variables afterward. So the tab1 command simply produces the frequency distribution of each variable listed. Even though multiple variables are listed it will not construct a crosstabulation of any kind. This command can be very useful in cases like this where we want to see the distribution of several variables. So we can either use the three tabulate commands above or the following to create the output we want, both will produce the exact same output:
tab1 stdsign1w stdsign2w stdsign3w
-> tabulation of stdsign1w
sign of STD |
in a woman |
- abdominal |
pain | Freq. Percent Cum.
------------+-----------------------------------
1 | 227 100.00 100.00
------------+-----------------------------------
Total | 227 100.00
-> tabulation of stdsign2w
sign of STD |
in a woman |
- vaginal |
discharge | Freq. Percent Cum.
------------+-----------------------------------
2 | 1,073 100.00 100.00
------------+-----------------------------------
Total | 1,073 100.00
-> tabulation of stdsign3w
sign of STD |
in a woman |
- itching | Freq. Percent Cum.
------------+-----------------------------------
3 | 347 100.00 100.00
------------+-----------------------------------
Total | 347 100.00
From the output above, we can see that if a respondent identified a specific symptom they were given a non-missing value for the symptom variable. Thus, if a respondent did respond with the symptom "abdominal pain" they were given a value of 1 for the variable stdsign1w. If the respondent did not produce this symptom, they were given a value of missing for the variable stdsign1w. This same pattern was followed for the remaining symptoms identified in the survey.
Now the question facing us is how can we construct a continuous measure representing the number of symptoms identified by the respondent, using the categorical variables in the BAIS data. Well for starters we know that if a respondent was given a non-missing value for a given symptom indicator, then that respondent named the corresponding symptom. If we were able to count the number of non-missing observations across all the symptom indicators we would be able to construct our continuous variable. This definitely sounds like a job for the egen command, the question is what function should we use. Considering we want to count up the number of missing observations, it sounds like a job for the count function. The problem is we need to work with several variables at the same time and the count function really isn't suited for this type of job. In the end, we want something very similar to the count function, we need to use the robs function (for more information and this and other functions use the command: help egen). The robs function is very similar to the count function. While the count function counts non-missing values across observations (usually within households), the robs function counts non-missing observations within a given observation. So to construct a variable equal to the number of symptoms identified by a respondent, we would want to use the following syntax:
egen wscore = robs(stdsign1w-stdsign11w)
egen mscore = robs(stdsign1m-stdsign11m)
So let's walk through what these two commands have done. The first command creates a new variable called wscore that is equal to the number of non-missing observations found across the variables stdsign1w, stdsign2w, stdsign3w, stdsign4w, stdsign5w, stdsign6w, stdsign7w, stdsign8w, stdsign9w, stdsign10w, and stdsign11w. This is exactly the variable we needed, something representing the number of symptoms identified by the respondent. So our new variable is a continuous variable ranging from 0 to 11 (as there are 11 variables - stdsign1w to stdsign11w). The second egen command does the exact same thing as the previous command, except it is using the variables stdsign1m - stdsign11m.
Let's take a look at our new variables to see what we have:
tab1 wscore mscore
-> tabulation of wscore
wscore | Freq. Percent Cum.
------------+-----------------------------------
0 | 5,494 71.04 71.04
1 | 612 7.91 78.95
2 | 808 10.45 89.40
3 | 506 6.54 95.94
4 | 165 2.13 98.07
5 | 54 0.70 98.77
6 | 18 0.23 99.00
7 | 16 0.21 99.21
8 | 23 0.30 99.51
9 | 19 0.25 99.75
10 | 17 0.22 99.97
11 | 2 0.03 100.00
------------+-----------------------------------
Total | 7,734 100.00
-> tabulation of mscore
mscore | Freq. Percent Cum.
------------+-----------------------------------
0 | 5,451 70.48 70.48
1 | 580 7.50 77.98
2 | 763 9.87 87.85
3 | 568 7.34 95.19
4 | 198 2.56 97.75
5 | 73 0.94 98.69
6 | 17 0.22 98.91
7 | 17 0.22 99.13
8 | 19 0.25 99.38
9 | 16 0.21 99.59
10 | 21 0.27 99.86
11 | 11 0.14 100.00
------------+-----------------------------------
Total | 7,734 100.00
Excellent, now we have the two variables we need. The first, wscore, is an individual level continuous variable that is equal to the number of symptoms a respondent provided that would lead them to think a women had a sexually transmitted disease. The second variable, mscore, is an individual level continuous variable that is equal to the number of symptoms a respondent provided that would lead them to think a man had a sexually transmitted disease.
The one thing we must keep in mind when working with these variables is that not everyone in the BAIS data file answered questions Q404 and Q405. As a result, several of the 0 values in the mscore and wscore distributions are simply respondents who didn't respond to the individual questionnaire at all. To account for this, when running our analysis we can use the variable rec_per (which is an indicator variable for whether a respondent answered the individual questionnaire) in a qualifying statement.
As a final step prior to running any analysis, let's ensure things are clear by labeling our new variables:
label variable wscore "number std symptoms identified for women"
label variable mscore "number std symptoms identified for men"
Now that we have these two variables constructed, we can begin to answer some the questions we posed earlier in the section. For example, are men more likely to be able to produce symptoms for men? Are women more likely to be able to produce symptoms for women? To answer these questions, we can use the following syntax:
tab gender if rec_per==1, sum(wscore)
tab gender if rec_per==1, sum(mscore)
. tab gender if rec_per==1, sum(wscore)
sex of | Summary of wscore
respondent | Mean Std. Dev. Freq.
------------+------------------------------------
Male | 1.2368973 1.8666789 1908
Female | 1.3981859 1.4908056 2205
------------+------------------------------------
Total | 1.3233649 1.677408 4113
. tab gender if rec_per==1, sum(mscore)
sex of | Summary of mscore
respondent | Mean Std. Dev. Freq.
------------+------------------------------------
Male | 1.6194969 2.0091238 1908
Female | 1.2358277 1.5145048 2205
------------+------------------------------------
Total | 1.4138099 1.7714565 4113
From the output above, we see that while the differences aren't huge, female respondents are able to produce more STD symptoms for women, while male respondents produce more STD symptoms for men. While we still need to investigate things further, one policy implication from these findings is that it may be necessary to educate respondents concerning the STD symptoms of the opposite sex. Although, before we make any final conclusions we should certainly examine things in more depth.
Let's return to the questions we posed earlier in the section. Another possible variable that may impact the relationship observed above is location. Specifically, are the number of symptoms that respondents are able to produce going to vary across the different locations within Botswana? To answer this question let's first look at the relationship between our score variables (mscore and wscore) and location, and then the three-way relationship between location, gender, and the STD symptom scores. To examine the score/location relationship, we can use the following syntax:
tab location if rec_per==1, sum(mscore)
tab location if rec_per==1, sum(wscore)
. tab location if rec_per==1, sum(mscore)
location of | Summary of mscore
household | Mean Std. Dev. Freq.
------------+------------------------------------
Urban | 1.6446429 1.7509267 1120
Urban Vil | 1.2750716 1.5937196 1047
Rural | 1.3556012 1.8601385 1946
------------+------------------------------------
Total | 1.4138099 1.7714565 4113
. tab location if rec_per==1, sum(wscore)
location of | Summary of wscore
household | Mean Std. Dev. Freq.
------------+------------------------------------
Urban | 1.5401786 1.682638 1120
Urban Vil | 1.2053486 1.518166 1047
Rural | 1.2620761 1.7440816 1946
------------+------------------------------------
Total | 1.3233649 1.677408 4113
From our results it is clear that respondents in urban areas provided more STD symptoms than respondents in either urban villages or rural areas. Is this what you expected to find? Why or why not? How do we expect the findings above to change once we introduce the variable gender? Well let's find out. To answer this question we enter:
tab location gender if rec_per==1, sum(mscore) mean
tab location gender if rec_per==1, sum(wscore) mean
. tab location gender if rec_per==1, sum(mscore) mean
Means of mscore
location |
of | sex of respondent
household | Male Female | Total
-----------+----------------------+----------
Urban | 1.820922 1.4658273 | 1.6446429
Urban Vil | 1.3589165 1.2135762 | 1.2750716
Rural | 1.6215316 1.1263158 | 1.3556012
-----------+----------------------+----------
Total | 1.6194969 1.2358277 | 1.4138099
. tab location gender if rec_per==1, sum(wscore) mean
Means of wscore
location |
of | sex of respondent
household | Male Female | Total
-----------+----------------------+----------
Urban | 1.4308511 1.6510791 | 1.5401786
Urban Vil | 1.0022573 1.3543046 | 1.2053486
Rural | 1.2308546 1.2889952 | 1.2620761
-----------+----------------------+----------
Total | 1.2368973 1.3981859 | 1.3233649
So using the combination of a crosstabulation and the summarize command, we are able to examine the three way relationship between location, gender, and wscore or mscore. Looking at the results it seems that all the previous relationships still hold. For both male and female respondents, the urban residents provided a larger number of symptoms than did urban village and rural residents. Additionally, regardless of location, male respondents were able to provide more male symptoms than female respondents, and female respondents were able to provide more female symptoms than male respondents.
While we have extended our analysis quite a bit from our first table, we still could do more. What other factors may impact the relationships we are observing in the tables above? There are several possibilities, for example education. We might hypothesize that respondents with higher levels of education are more likely to provide symptoms than respondents with lower levels of education. Certainly this isn't the only factor that could be influencing the relationship observed above, but we have to start somewhere. First, let's examine the impact of education on the wscore and mscore variables, and then as a final step look to see if education will impact the findings of our three-way tables above.
To examine the relationship between education and the score variables, we need to decide on an education measure to use. We could use the variable educlev (highest level of education obtained), although the specific distinctions the variable provides may not be necessary. Let's instead create a new variable that is equal to 1 if a respondent has an education level of "secondary" or higher, and is equal to 0 if they do not have an education level of "secondary" or higher. Using this variable we can examine the general effects of education on the score variables.
So let's first create our new education variable:
gen ed = educlev>2 if educlev~=.
The syntax above creates our new dummy variable ed and sets it equal to 1 if the variable educlev is greater than 2 (which includes category 3 "secondary" and category 4 "higher"), and is set to 0 if educlev is less than or equal to 2 (which includes category 2 "primary" and category 1 "non-formal"). Prior to analyzing the data, let's completely label our new variable so our tables are easier to interpret.
First let's label the variable ed:
label variable ed "high/low education indicator"
Next, let's create value labels for the new variable:
label define ed 0 "lower educ" 1 "higher educ"
label values ed ed
Now that are variable is fully labeled, let's examine the relationship between education and the score variables. To do this we can use the following syntax:
tab ed, sum(wscore)
tab ed, sum(mscore)
. tab ed, sum(wscore)
high/low | Summary of number std symptoms
education | identified for women
indicator | Mean Std. Dev. Freq.
------------+------------------------------------
lower edu | .97715736 1.5229832 1576
higher ed | 1.8154255 1.7470347 1880
------------+------------------------------------
Total | 1.4331597 1.7004775 3456
. tab ed, sum(mscore)
high/low | Summary of number std symptoms
education | identified for men
indicator | Mean Std. Dev. Freq.
------------+------------------------------------
lower edu | 1.0412437 1.6458354 1576
higher ed | 1.9090426 1.8263412 1880
------------+------------------------------------
Total | 1.5133102 1.7988087 3456
From the tables above we see that the level of a respondents education matters quite a bit. Respondent's with higher levels of education are much more likely to provide more STD symptoms than respondents with lower levels of education. The final question is, how does education impact our previous findings which examined the relationship between gender, location, and the score variables. To find the answer to this question type:
sort ed
by ed: tab location gender if rec_per==1, sum(mscore) mean
by ed: tab location gender if rec_per==1, sum(wscore) mean
. by ed: tab location gender if rec_per==1, sum(mscore) mean
---------------------------------------------------------------------------------------------------------- -> ed = lower educ
Means of number std symptoms identified for men
location |
of | sex of respondent
household | Male Female | Total
-----------+----------------------+----------
Urban | 1.6443299 1.0243902 | 1.3603352
Urban Vil | .7962963 .89519651 | .85421995
Rural | 1.1918159 .81192661 | .99153567
-----------+----------------------+----------
Total | 1.2235609 .87696019 | 1.0412437
---------------------------------------------------------------------------------------------------------- -> ed = higher educ
Means of number std symptoms identified for men
location |
of | sex of respondent
household | Male Female | Total
-----------+----------------------+----------
Urban | 2.2226027 1.7925072 | 1.9890454
Urban Vil | 2.030303 1.5865385 | 1.7588235
Rural | 2.3417722 1.6409639 | 1.9439124
-----------+----------------------+----------
Total | 2.2220844 1.6741155 | 1.9090426
. by ed: tab location gender if rec_per==1, sum(wscore) mean
---------------------------------------------------------------------------------------------------------- -> ed = lower educ
Means of number std symptoms identified for women
location |
of | sex of respondent
household | Male Female | Total
-----------+----------------------+----------
Urban | 1.2164948 1.2560976 | 1.2346369
Urban Vil | .5617284 1.0436681 | .84398977
Rural | .85421995 .99541284 | .9286578
-----------+----------------------+----------
Total | .88487282 1.0603136 | .97715736
---------------------------------------------------------------------------------------------------------- -> ed = higher educ
Means of number std symptoms identified for women
location |
of | sex of respondent
household | Male Female | Total
-----------+----------------------+----------
Urban | 1.8219178 1.9740634 | 1.9045383
Urban Vil | 1.510101 1.7339744 | 1.6470588
Rural | 1.9113924 1.8120482 | 1.8549932
-----------+----------------------+----------
Total | 1.780397 1.8417132 | 1.8154255
Looking at the tables above, there are several very interesting findings. First, it is clear that the level of education does matter, and has a large impact on the number of symptoms a respondent provides. For both wscore and mscore, respondents with higher education provide more symptoms regardless of their gender. Second, the effect of location has changed slightly with the introduction of education. For the most part, respondents in urban areas provide most STD symptoms, although unlike previous examples this does not always hold true. For male respondents with higher education levels, individuals in rural area provide just as many, if not more, STD symptoms compared to in urban areas. Finally, in all but one instance, female respondents regardless of the location or education level provide more STD symptoms for women than do male respondents. Alternatively, in all instances, male respondents regardless of the location or education level provide more STD symptoms for men than do female respondents. So overall, it appears that the gender differences we observed early on still hold even after accounting for location and education.
So are there any more variables that may play a role in the number of symptoms provided by respondents? Try and think of another variable we could possibly use in this analysis, and see whether it changes any of the findings we have produced so far.
In the remaining lesson modules, we will learn more advanced techniques to "account for" or "control for" several variables. Which should enable us to produce even more interesting and robust findings.
Chi-Squared: Testing for Independence
By now, we have examined tables of variables. Perhaps you have noticed that in a few examples as one variable increased or decreased, the other variable in the cross tab decreased or increased. While the naked eye is good at noticing these relationships, it is unclear how accurate the relationships are until we examine them statistically. It is a good idea before any further analysis of the variables occurs to test whether the variables in the crosstab are independent or not. By independent, we mean whether as X moves one way or another, Y's movements are completely random with respect to X. As we shall see in the next module, this is a good test to run now. This test for independence will test for any kind of functional relationship. In the next module, we will be working only with linear relationships.
Let's try this simple example with the variables educlev and literacy.
tabulate educlev literacy, row col chi2
highest |
level of | literacy
education | Reads eas Reads wit Does not | Total
-----------+---------------------------------+----------
Non-Formal | 15 21 7 | 43
| 34.88 48.84 16.28 | 100.00
| 0.52 4.27 10.45 | 1.25
-----------+---------------------------------+----------
Primary | 1,049 424 60 | 1,533
| 68.43 27.66 3.91 | 100.00
| 36.25 86.18 89.55 | 44.40
-----------+---------------------------------+----------
Secondary | 1,555 47 0 | 1,602
| 97.07 2.93 0.00 | 100.00
| 53.73 9.55 0.00 | 46.39
-----------+---------------------------------+----------
Higher | 275 0 0 | 275
| 100.00 0.00 0.00 | 100.00
| 9.50 0.00 0.00 | 7.96
-----------+---------------------------------+----------
Total | 2,894 492 67 | 3,453
| 83.81 14.25 1.94 | 100.00
| 100.00 100.00 100.00 | 100.00
Pearson chi2(6) = 623.2008 Pr = 0.000
The Pr=0.000 tells us that the two variables are related. Logically, this makes sense. You would expect that those with higher levels of education could read easily, while those with lower levels of education would be more likely to read with difficulty or not read at all.
EXERCISES
- For each level of literacy, what are the average years of schooling for urban, rural, urban village respondents?
- Exercise 1 Answer
- What percentage of households residing in a Town home own a donkey/horse?
- Exercise 2 Answer
- What is the average age at first marriage for men and women in each of the different religious groups?
- Exercise 3 Answer
- Create a variable that is equal to the largest age difference in the household.
- Exercise 4 Answer
| BACK TO TOP |