TABLE OF CONTENTS
-
Introduction
Launching Stata
Getting Ready
Loading Data into Stata
Exploring the Data
Managing the Data
Creating and Using Stata Do Files
Exercises
INTRODUCTION
In this module, you will be introduced to the statistical program Stata. By the time you are finished with this module, you will be able to answer questions like:
- How many households are in the BAIS data set?
- What are some variables in the data set that are related to pregnancy?
- How old is the oldest person in the data set?
- How many 30 year old men are in the data set?
LAUNCHING STATA
The first thing we need to do is to start the statistical software program, Stata. So let's go! If you know how to start Stata, do so now. If you don't, here are some helpful hints. Launch Stata now if you have not already done so.
We will need to go back and forth between the window that Stata uses and the window used by your web browser (with which you are now viewing this text.)
When Stata is running, there are a number of "windows" within Stata. The "Stata Command" window is where we will enter all Stata commands. Note all attempts will be made to display all relevant Stata commands used in these modules in black Courier font (as shown here). The "Review" window lists all commands run by Stata. We will be able to repeat any command listed here by simply clicking on them instead of re-typing them. The "Stata Results" window is where all the output from our commands appears. The "Variables" window lists the variables that are in your data set. When we first open Stata, all these windows will be blank except for the "Stata Results" window.GETTING READY
We'll need to do a couple things before actually loading the data set. First, you will want to get in the habit of opening what is called a log file before you start your work. This file will record all the input that you type as well as all the output produced by Stata. It is a useful file to have for a number of reasons. It will let us re-create our work if we later decide we want to redo something. It also allows others to replicate our work. The log file can also be used to cut results from and paste them into another file for editing purposes. For all these reasons, always open a log file. We can easily delete it later if we decide not to keep it. To open a log file, in the Stata command window, type:
log using "C:\Workshop\filename.log", replace
If the log file does not already exist, you will see a Stata message warning you that the file you created is new. That's fine. The replace option in the command line above tells Stata to write over any existing log file with the same name. If we wanted to add this log file on to an existing file without erasing the contents of the existing file, we would use:
log using "C:\Workshop\filename.log", append
AN IMPORTANT HINT: For every Stata command that we will use, there is on-line help within Stata. For example, above we used replace and append as options to the log command. There are other options and these can be investigated by simply typing: help log
This help command works for almost every command in Stata. If there are no Stata manuals handy, the help command is invaluable. The help command will tell us how to use a command, what that command does, its options, and even some examples. Use it often!
The step we will need to take before opening the data set is to tell Stata how much memory the data set will require. In this module, we will begin by using the BAIS data set. This data set is:
C:\Workshop\bais.dta
Other data sets have slightly different names. If you are using a folder other than "Workshop" (which would be a subdirectory you would create on your 'C:\' drive), that name too would be different. This data set requires 5 megabytes of memory. To tell Stata how much memory to set aside for our data, type:
set mem 5m
If you were using a larger BAIS data file, or any other larger data set, you would substitute the "5m" in the set mem command above with a value slightly larger than the size of the data file in megabytes. Thus, if a data file is 10 megabytes in size, you would want to set the memory to a value of 11 or higher.
LOADING THE DATA INTO STATA
Now we are ready to open the data. There are several ways to do this, the easiest way while in Stata is to click on the "File" menu at the top left and then click on "Open." Then navigate your way to the folder to which you downloaded the data. We have called that folder Workshop. Finally, click on the data set. It will be named "bais.dta". After selecting the data file, click on "Open" and the file will open in Stata. Alternatively, if you know the full path and file name of the data file, you could enter this directly into Stata in the Stata "Command" window using:
use "C:\Workshop\bais.dta", clear
The clear option at the end of the above command will remove from memory any data currently there. Using this option does not pose a problem when you are getting started, but you should be aware that if you load a new data set using the clear option, you will lose all the changes you might have made to the data loaded in memory, unless you save the changes before using the command. Now we are ready to begin exploring the data!
EXPLORING THE DATA
Now that the data is loaded into Stata you will notice that the "Variables" window now has two new columns of information. The first column is the list of variables in the data we just loaded and the second column displays the attached variable labels. A variable label is a brief description of the variable's content. For example, the first variable in this window is hhid and the label for this variable reads "household id number". To learn more about what this variable really is, we would need to go back to the BAIS Survey. We'll do this later. In this case, hhid is a unique number given to each household in the survey that allows us to identify the household without compromising the household's true identity. Take a few minutes to scan the list of variables using the scroll bar on the "Variables" window. There are several ways within Stata to further explore the contents of your data set. One example of this is the command ds. In the Stata command window type:
ds
In the "Stata Results" window, we now see a list of variable names. These are the same variable names that appear in the "Variables" window. The ds command is advantageous in that it lists a large number of variable names at once, although variable labels associated with these variable names are not shown. Nevertheless, this is a handy way to see many variable names at once. It turns out that at times not all the variable names will fit in the window. At the bottom of the window, it may say "--more--." This is a common occurrence in Stata. By simply tapping the space bar on your keyboard, you can scroll through the information in the "Stata Results" window. If you don't want the output to stop after every full screen, just type:
set more off
Here are some other useful commands for exploring the data set:
describe will tell you how many observations are in the data set, how much memory the data set is using, what the variables are, how much memory each variable is using, and how many variables are in the data set. There are other details that don't really concern us at this point.
codebook will provide very detailed information about every variable in your data set. It will tell you, for each variable, the number of missing observations, the largest and smallest values, the number of unique values, and some information regarding the means and standard deviation of the variable.
list will simply print the data on the screen. It will provide you with more information than you probably want unless you use it with some of the qualifiers described immediately below.
lookfor is like a search engine. You can specify what you are looking for and this command will list all variable names or labels that contain the list of letters (or string) that you give it. Some examples are listed after the introduction of the qualifiers and operators.
As you use these commands, you will often want to use qualifiers and operators. By using these options, you can restrict the specified Stata command to a specific subset of the data.
| Qualifiers: | Comparison Operators: | Logical Operators: | |||
|---|---|---|---|---|---|
if |
qualify when a command is executed. |
== | equal to |
| |
or |
| in | specify which observations to examine | != | not equal to | & | and |
| ~= | not equal to | ||||
| > | greater than | ||||
| < | less than | ||||
| >= | greater than or equal to | ||||
| <= | less than or equal to | ||||
Qualifiers and operators add more detail to these data exploration commands. For instance, try some of the following examples:
describe
This command alone lists all the variables and their corresponding labels in the data set.
Now if we type the following:
describe relhead educ gender
A list of the variables relhead, educ, and gender as well as their labels are displayed in the Stata Results window.
Now what if we wanted to explore information specific to a particular group or person only? This is where the qualifier and operators are used. Using qualifiers and operators allows us to apply Stata commands to specific observations in the data. To make things a bit more clear, here are some examples:
list in 200
Allows us to examine the data associated with the 200th observation.
list if gender == 2
Allows us to examine the contents of all observations where gender equals 2; the number 2 in this case refers to women, thus only information for female respondents will be displayed.
list educ if age > 50
This command will print all the observations for the educ variable where age of the individual is greater than 50. When you try this command, you will note that many of the observations for educ are recorded as a missing value, "." If you wanted to list the educ of all individuals older than 50 years old and did not want to list those for whom educ was missing, you could type:
list educ if age > 50 & educ ~= .
codebook resyears
This provides you with descriptive output for the variable resyears. The output from this command tell us that the number of years a respondent has continuously lived in the respective locality varies from 0 to 64 in the BAIS data, and that the average number of years living in the locality is 12.92.
lookfor birth
Gives us all the variables that have "birth" in their name or their label. Entering this command, we can see that in addition to information related to the number of times a respondent gave birth, there are several other variables related to birth in the BAIS data set.
Let's see if you have gotten the hang of this, try these quick exercises:
- 1. What is the hhid value for the 1000th observation?
- Question 1 Answer
- 2. What are the ages of the respondents with an hhid equal to 629?
- Question 2 Answer
- 3. How many 50 year old female respondents are there in the data? (Be careful this one is a bit tougher than the others)
- Question 3 Answer
MANAGING THE DATA
In this section, we'll learn some data management commands. In many circumstances, we will want to amend the original BAIS data set. We might want to add new variables that we create from the existing variables, we might want to drop variables that we will never use to free up memory, we might want to recode missing values, and/or we might want to create our own variable labels. While we will not deal with some of the trickier data management issues (such as merging data sets) in this section, you will learn enough to get started. We'll start by creating a new variable. To create a new variable in Stata you use the command generate. When using generate (or gen for short) you must specify two values, a name for the new variable, and what the new variable is equal to. Let's try an example. Suppose we want to create a new variable called "temp" and we would like this variable to be equal to 1 for every observation in the data set.
To create this new variable you need to type:
generate temp = 1
Go ahead and enter this command into Stata. You will notice that our new variable temp has been added to the end of the variable list in the Stata Variables window. How do we know for sure what this new variable is equal to? Did the command work correctly?
See if you can figure this out:
- 4. How can we check to make sure our new variable is equal to 1?
- Question 4 Answer
- 5. How would you create a new variable called "temp2" that is equal to 50 for all of the observations in the data set?
- Question 5 Answer
Now that we have created our new variable, it would be useful to create a label for it to help us identify what the variable is. To create a label for a variable you use the Stata command label variable. To label the new variable temp type:
label variable temp "This is a temporary variable equal to 1"
This command will add the label "This is a temporary variable equal to 1" to the new variable temp. Typing the following Stata command you will see that the new label has been added:
describe temp
Now you may be asking yourself, why would I want to create a variable like temp that has the same value for each observation in the data set. Variables like temp can actually be quite useful at times, although a majority of the variables you create will not have the same value for all observations. Let's try another example. Using the existing variable literacy, let's create a new variable that has only two values: respondents who can read and respondents who can not read. The variable literacy in the BAIS data set has three values: reads easily, reads with difficulty, and can not read at all. To construct our new variable, we will want to use the commands generate and replace. We'll call the new variable read. In the Stata Command window, type:
generate read = .
This command will create a new variable called read and all values of this variable are set to "missing". Now, if the individual is able to read (either easily or with difficulty), we want to set the variable read equal to 1. If the individual is unable to read, we want to set the variable equal to 0. We do this by using the replace command. The replace command allows us to change the values of an existing variable. By typing the following command we recode the new variable read according to our desired scheme:
- replace read = 0 if literacy==3
- replace read = 1 if literacy==1 | literacy==2
In the above commands, we needed to know how the original variable literacy was coded. All original coding can be found in the BAIS survey. Next we will want to put a label on this variable so we know what it is. This requires the label variable command.
Type:
label variable read "0 can not read, 1 can read"
You will now notice that at the bottom of the Variables window that our new variable read is listed with it's new label. Alright now it's your turn. Try to answer the following questions:
- 6. How would you create a new variable called "head" that is equal to 1 if the individual is the resident head, and equal to 0 if the individual is not?
- Question 6 Answer
- 7. How would you label this new variable with the following label - "Resident head indicator"?
- Question 7 Answer
As you will soon learn, there are always several ways to accomplish any task within Stata. This is certainly true of the indicator variable we just created. While the syntax we used above creates the desired variable, there are several alternate methods to creating the same variable.
For example, instead of using three commands to create the indicator variable, there is a way to create the variable with a single command (actually there are a few, but we won't learn them all here). To recreate our read variable, we could use the following:
generate read2= literacy~=3 if literacy~=.
Using this syntax, Stata will set the read2 variable equal to 1 if the statement after the equals sign is "true", if the statement is "false" then the read2 variable is set equal to zero. It is worth noting that if we had excluded the qualifier from this command we would not have constructed the desired measure. Using the following syntax:
generate read2= literacy~=3
would set the read2 variable equal to 1 if the statement is true, and 0 if it is false. Do you see the difference between the two commands. Without the qualifier, the observations coded as missing in the literacy distribution would be coded as 0. This is because the missing observations do not satisfy this statement and thus are considered to fall in the false category. Thus, when using this method, you must be very careful to consider missing values in the data you are using.
We can also use this method of variable construction when converting a continuous variable to an indicator variable as well. For example, what if we wanted to create an indicator variable that was equal to 1 for respondents under the age of 50, and 0 for those respondents who are 50 or older. In this case, we are taking the continuous variable age and converting it to an indicator variable. To create this new variable we could use the following command:
generate age50= age<50
The command above will set the variable age50 equal to 1 if their age is less than 50. Conversely, age50 will be set to 0 if a respondent's age is 50 or greater. Now here is a case where we didn't need to include a qualifier to deal with missing observations, as the variable age has no missing values. What would have happened if it did?
Given this new method of creating variables, it should be easier and quicker to construct indicator variables from here on out.
Other useful data management commands are:
drop will remove from memory the variables that are listed after the command. For example, if we no longer needed the variable read, we could type drop read. If we want to drop lots of variables, it is usually easier to use the keep command instead.
keep will retain only the listed variables and drop all the others. Be careful when using this command since it eliminates from memory everything that is not listed.
CREATING AND USING STATA "DO" FILES
Up to this point, we have been entering all Stata commands using the command window. In the process of recoding the more complex variables or in the process of creating more sophisticated graphs, you will find it cumbersome to enter long lines of syntax (commands) line by line. We will now learn how to be much more efficient by using Stata .do files. As the name implies, ".do" files are files that help you "do" commands with Stata.
Normally, do files are created using a simple text editor, like Notepad or any other word processing program, however, Stata itself has a Do-File Editor. In general, you can use any text editor as long as you save the document with a Do extension. After correctly typing your commands into a do file, you save it, and then run it after telling Stata where it is at. After telling Stata where the do file is at, it will execute the commands contained within the do file.
Like many other things in Stata, there are several ways to find and run your do files. One way, is to use FILE on the toolbar, clicking on DO..., and then finding the do file in the directory where it was saved. A second method, is to use an explorer window to find the do file and then double-clicking it to have Stata execute it. Another, but most tricky, is to search for it using your command window. This method is a bit harder because it requires basic knowledge of DOS/UNIX commands to navigate the various directories and subdirectories.
Lets find out what a do file looks like. Due to the immediate availability of the Stata Do-File editor, we will use it to type in and save our Stata syntax (commands). We can open the Do-File editor by either clicking on the icon (looks like right-hand holding a pen over a white pad of paper) or to use the keyboard shortcut - press the ctrl + 8 keys together. Once open, you can either type in the syntax below or highlight it, copy, and paste it into the Do-File editor. Either way, after entering the commands you will want to save the file and note in what directory you save it in. (Note: if you type it manually, you do not need to include the comments, however, if you copy and paste it, the comments will not interfere with the commands.) After entering the syntax and saving the file, search for the do file as outlined above, watch Stata do it's magic. Lets try it - either type, or copy and paste, the syntax below into the Do-File editor:
***********************************************************
set mem 5M /*Sets the memory to 5M*/
set mat 800 /*Sets the number of variables allowed in any given model
estimation*/
set more 1 /*Allows the output to scroll by without requiring user
assistance*/
#delimit ; /*Tells Stata that every command line below ends with an
";"*/
log using example.log, replace; /*Tells Stata to log output and to
replace the old one*/
use C:\Workshop\bais.dta; /*Tells Stata to load and use the named data
file*/
generate read = .
- replace read = 0 if literacy==3
- replace read = 1 if literacy==1 | literacy==2
label variable read "0 can not read, 1 can read"
log close; /*Closes any opened log file*/
***********************************************************
***********************************************************
Now, lets review what the syntax above is telling Stata.
The first three lines set the environment for Stata to work in. In particular, the set more 1 command is important to include. It is similar to the set more off command, however when set more off is typed in it is permanent for that session of Stata until you type in set more on. If instead you use set more 1, Stata will temporarily set more off for that do file and then reinstates it after it is done.
The fourth line, is also very important, it allows us to break up long command lines into multiple lines. Unless otherwise told, Stata will automatically assume that a command line ends with a carriage return (i.e., enter key). The command #delimit ; tells Stata to execute whatever is before the next semi colon as one command. This option can be reset with #delimit cr, which then tells Stata to execute everything before the next carriage return as one command (default setting). The #delimit ; command can be quite useful, especially when we start writing longer lines of code.
The remaining commands, you should be well acquainted with. Although, it is important to realize that virtually every command you enter in the command window, can be entered with a do file.
Including Comments with your Syntax:
Note the use of comments. Every good programmer will include more than enough comments to make their syntax completely understandable to anyone else interested in the coding, recoding, or creating of new variables. In general, we recommend and embrace an active and prolific use of comments. In the example above, our comments are meant to document the purpose of specific command lines, however, in a typical do files, it is likely that we would only extensively comment the newly created variables and the rationale behind them. As you begin to develop your own do files and new variables, we encourage you to comment your new creations.
Using Log Files to Create Do Files:
While it can be a bit cumbersome to edit a log file, it is definitely a viable alternative to creating a do file from scratch. As discussed earlier, every command entered in the command window will be noted in your log file. After executing any given command, however, Stata will always preface each command line with a "." (dot). Thus, after entering Stata commands interactively and saving your log file, you can edit your log file by removing the unnecessary dots and any other Stata comments and thereafter saving your edited log file as a new do file - with a do extension. After that, you are ready to rerun all your saved commands.
EXERCISES
Using Stata and the data set named bais.dta, answer the following questions. After you think you have the answer, you can click on the "Answer" link to see if you have the correct answer. Remember, use the help command if you need to. We'll start with the questions presented at the beginning of this module.- How many households are in the BAIS data file?
- Exercise 1 Answer
- What are some variables in the data set that are related to pregnancy?
- Exercise 2 Answer
- How old is the oldest person in the data set?
- Exercise 3 Answer
- How many 30 year old men are in the data set?
- Exercise 4 Answer
- How would you create a variable that is equal to 1 for respondents who are citizens of India and 0 for those who are not?
- Exercise 5 Answer
- How would you drop every variable from the data set except the new variable you just created?
- Exercise 6 Answer
| BACK TO TOP |