e-TA 1: Brief Introduction to Stata
Welcome to e-Tutorial, your on-line help to Econ508. The introductory material presented below is the first of a series of handouts that will be distributed along the course, designed to enhance your understanding of the topics and your performance on the homework. This very first issue focuses on the basic operations of the main software used in the course (STATA). The core material was extracted from Gregory Kordas' "Computing in Econ 472" (1999) and "A Tutorial in Stata" (1999). The usual disclaimers apply.1
Accessing Stata
The statistical package Stata can be found at the OCSS (Office of
Computing and Communications for Social Sciences) lab, located at the
212 Lincoln Hall. Also, you can find Stata at the Foreign Languages
Building (FLB) (room G8). It is also available at the Econometrics Lab,
DKH, for students enrolled in the Econometrics field or other classes
that require lab experiments.Stata is probably the most widely used statstical software for applied econometrics. It already comes with an extensive library of functions and it is possible to easily download user-written functions. Additionally, Stata has an extraordinary set of reference books, and by this reason some students may be interested in purchasing the package. In those cases, the best strategy is to form a group of students and make a special order to STATA Inc. For Econ 508 purposes, however, the weekly edition of e-Tutorial will bring all necessary information to solve the homework. A very important resource to keep in mind whenever you encounter a problem not descibed in the manuals is the official Stata forums at statalist.org
First steps in Stata
Having installed Stata the next step is learning the syntax of the language, this means learning the rules of it. After you open Stata you are going to see the Stata console, which displays the results of your analysis or any messages associated with your code that is entered in the command line (the box at the bottom of the screen).
For example, we can use Stata as a calculator. In the command box you can type:
display 2 + 2
4
or
display
log(1)
0
To quit the program just type:
exit, clear
Note: most Stata commands have several options options which are invoked after placing the ",". In the example above, if data were loaded in the memory then typing "exit" only would give the following error:
no; data in memory would be lost
r(4);
therefore, it is necessary to clear the data from the memory before exiting Stata.
This could be done in two lines of command by typing "clear" first and
then "exit". Or by using the clear option of the "exit" command (as
above). To learn more about what a command does and its options, you
can always refer to its help file by typing:
help exit
Scripting your work
Rather than saving the work space, it is highly recommended that you
keep a record of the commands entered, so that we can reproduce it at a
later date. The easiest way to do this is to record all your commands
on a do-file,
available from the File menu. Commands are executed by highlighting
them and hitting Run (or Ctrl+Shift+D for PC and Command+Shift+D for
Mac). At the end of a session, save the final script for a permanent
record of your work.
A do-file is a text file that contains lines of Stata code that can be saved and use over and over again. This is the preferred method to save your work and guarantee reproducibility. To know more on reproducible research you should read Professor Koenker's Reproducibility in Econometrics Research webpage
A useful tip to keep in mind is that everything that is written
after a * sign is assumed to be a comment and is ignored by Stata.
Working in Stata
The best way to learn Stata is to dive right in and work through a simple example.
Example - The U.S. Economy in the 1990s
Let's start with an analysis of the performance of the U.S. economy during the 1990s. We have annual data on GDP growth, GDP per capita growth, private consumption growth, investment growth, manufacturing labor productivity growth, unemployment rate, and inflation rate. (The data is publicly available in the statistical appendixes of the World Economic Outlook, May 2001, IMF).
The first step is to tell Stata the location of your working
directory. This means telling Stata where are all the files related to
your project. You should do this always at the beginning of your
session. You do so by using the cd path
function. Where path is the path to the folder where you want to write and read things. For example
cd "C:\Econ508\eTA\"
This command line is telling Stata to read and write everything in
the Econ508\eTA folder (that I assume you created before hand).The next step is to download the data. The data is available in in text format here. To load the data, type:
insheet using "US90.txt"
In the commands above, the term insheet
refers to the action executed by Stata, this is followed by using
after which we must indicate the name of the file we want to load. insheet
can be used when the data is delimited and Stata will automatically
detect the type of delimiting used (i.e. comma-separated, tabs, etc.).
For ASCII data with *.raw format, you must use the infile
command.
After that you can visualize the data in a spread-sheet format type browse
. Or to visualize small data sets you can also type:
list
+-------------------------------------------------------------------+
| year gdpgr consgr invgr unemp gdpcapgr inf producgr |
|-------------------------------------------------------------------|
1. | 1992 3.1 2.9 5.2 7.5 1.9 3 5.1 |
2. | 1993 2.7 3.4 5.7 6.9 1.5 3 1.9 |
3. | 1994 4 3.8 7.3 6.1 3 2.6 3 |
4. | 1995 2.7 3 5.4 5.6 1.7 2.8 3.9 |
5. | 1996 3.6 3.2 8.4 5.4 2.6 2.9 3.4 |
|-------------------------------------------------------------------|
6. | 1997 4.4 3.6 8.8 5 3.4 2.3 3.8 |
7. | 1998 4.4 4.7 10.7 4.5 3.4 1.5 6.2 |
8. | 1999 4.2 5.3 9.1 4.2 3.2 2.2 5.8 |
9. | 2000 5 5.3 8.8 4 4.2 3.4 7.2 |
10. | 2001 1.5 2.5 3.3 4.4 .7 2.6 4.1 |
|-------------------------------------------------------------------|
11. | 2002 2.5 2.4 3.8 5 1.8 2.2 3 |
+-------------------------------------------------------------------+
You can also find out how many observations (rows of data) you have by typing: count
11
The final step is to save your data in Stata format (*.dta). This can be done by typing: save US90
file US90.dta saved
Basic Operations
A useful way to explore your data is checking the main statistics of each variable. For example, in the Stata Command window you can obtain the minimum, maximum, arithmetic mean, and standard deviation of each variable in your data set by typing:
summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
year | 11 1997 3.316625 1992 2002
gdpgr | 11 3.463636 1.050974 1.5 5
consgr | 11 3.645455 1.03476 2.4 5.3
invgr | 11 6.954545 2.408885 3.3 10.7
unemp | 11 5.327273 1.125247 4 7.5
-------------+--------------------------------------------------------
gdpcapgr | 11 2.490909 1.048289 .7 4.2
inf | 11 2.590909 .5204893 1.5 3.4
producgr | 11 4.309091 1.590883 1.9 7.2
If you are only interested in a single variable, just include its name after the command:
summarize
gdpgr
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
gdpgr | 11 3.463636 1.050974 1.5 5
If you also wish to know the behavior along the percentiles for the variable gdpgr, type
summarize
gdpgr, detail
gdpgr
-------------------------------------------------------------
Percentiles Smallest
1% 1.5 1.5
5% 1.5 2.5
10% 2.5 2.7 Obs 11
25% 2.7 2.7 Sum of Wgt. 11
50% 3.6 Mean 3.463636
Largest Std. Dev. 1.050974
75% 4.4 4.2
90% 4.4 4.4 Variance 1.104545
95% 5 4.4 Skewness -.3234939
99% 5 5 Kurtosis 2.156121
If you are only interested in a subset of your data, you can inspect
it using filters. E.g., if you are only interested gdp growth in the
years of the Clinton administration, you type
summarize
gdpgr if year>=1993 & year<=2000
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
gdpgr | 8 3.875 .8259194 2.7 5
And then you can contrast that period with the family Bush administrations:
summarize
gdpgr if year<1993 | year>2000
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
gdpgr | 3 2.366667 .8082903 1.5 3.1
You may also check all years but the election years, to avoid political cycles:
summarize
gdpgr if year~= 2000 & year~=1996 & year~=1992
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
gdpgr | 8 3.3 1.090216 1.5 4.4
At this point you have already noticed the main logical operators in Stata:
>= means "greater or equal",
<= means "less or equal",
& means "and",
| means "or".
The arithmetic operators are as usual (+, -, *, /). And to create a
new variable using them, you can do as follows: Suppose you wish to
know how close the GDP growth is to the GDP per capita growth. So, you
create a ratio of those two variables, and check it:
generate
gdpratio = gdpgr / gdpcapgr
summarize gdpratio
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
gdpratio | 11 1.487338 .2820946 1.190476 2.142857
The same procedure can be done to obtain traditional transformations, such as
squares: gen produc2=producgr^2
square roots: gen infroot=sqrt(inf)
exponential: gen expgdpgr=exp(gdpgr)
natural logs: gen logunemp=log(unemp) or simply gen lnunemp=ln(unemp)
base 10 logs: gen log10inf=log10(inf)
A final remark is that you should choose a name for the generated
variable with at most 32 characters (or less depending on the version
of Stata you are using), otherwise the system will give an error
message. Nevertheless, you can always (and maybe should for
convenience) describe your variables in more detail using the commands label
or notes
.
Exploring Graphical Resources
Suppose now you want to check the relationship among variables. For example, you want to see how much consumption and investment are correlated with GDP (all variables in growth rates). The command for that is:
graph twoway (scatter gdpgr year)
(scatter
consgr year)
(scatter
invgr year)
The output is a scatter of points for each series, with the
investment series being relatively higher than the other two series. If
you are interested in a line graph. The command is as follows:
graph twoway (line gdpgr year)
(line
consgr year)
(line
invgr year)
To plot specific ranges, add a title and name the axis:
graph twoway (line gdpgr year if year>=1993 & year<=2000)
(line
consgr year if
year>=1993 & year<=2000)(line
invgr year if year>=1993 & year<=2000), title("Figure 1.
GDP, Cons, and Inv for Selected Years") xtitle("Year") ytitle("Growth Rate")
Finally, you can combine graphs in a single figure. For example,
suppose you would like to obtain a graphical diagnostic on the
relationship between GDP and consumption growth rates, GDP and
investment growth rates, GDP and productivity growth rates, and revisit
the relation between unemployment and inflation rates. The commands to
do that are as follows:
scatter gdpgr consgr, saving(part1)
scatter
gdpgr invgr, saving(part2)
scatter
gdpgr producgr, saving(part3)
scatter
unemp inf, saving(part4)
graph combine part1.gph part2.gph part3.gph part4.gph
Through the commands above, you generated and saved four individual
graphs, and plotted them into a single figure. This is indeed a
very useful tool to check pair wise correlation among variables, before
you run a regression.
Linear Regression
As remarked above, before running a regression, it is recommended to check the cross correlation among covariates. You can do that graphically (see above) or using the following simple command:
correlate
gdpgr consgr invgr unemp gdpcapgr inf
| gdpgr consgr invgr unemp gdpcapgr inf
-------------+---------------------------------------------------------------
gdpgr | 1.0000
consgr | 0.8394 1.0000
invgr | 0.9097 0.8270 1.0000
unemp | -0.3035 -0.4761 -0.3684 1.0000
gdpcapgr | 0.9890 0.8347 0.8841 -0.4143 1.0000
inf | -0.1012 -0.1198 -0.3090 0.3590 -0.1230 1.0000
From the matrix above you can see, for example, that GDP and GDP per capita growth rates are closely related, but each of them has a different degree of connection with unemployment rates (in fact, GDP per capita presents higher correlation with unemployment rates than total GDP). Inflation and unemployment present a reasonable degree of positive correlation (about 36%).
Now you start with simple linear regressions. For example, let's
check the individual regressions of GDP with consumption and investment
growth rates:
regress gdpgr consgr
Source | SS df MS Number of obs = 11
-------------+------------------------------ F( 1, 9) = 21.46
Model | 7.78197201 1 7.78197201 Prob > F = 0.0012
Residual | 3.26348251 9 .362609168 R-squared = 0.7045
-------------+------------------------------ Adj R-squared = 0.6717
Total | 11.0454545 10 1.10454545 Root MSE = .60217
------------------------------------------------------------------------------
gdpgr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
consgr | .8525216 .1840263 4.63 0.001 .4362251 1.268818
_cons | .3558076 .6949943 0.51 0.621 -1.216379 1.927994
------------------------------------------------------------------------------
regress gdpgr invgr
Source | SS df MS Number of obs = 11
-------------+------------------------------ F( 1, 9) = 43.22
Model | 9.14164404 1 9.14164404 Prob > F = 0.0001
Residual | 1.90381048 9 .211534498 R-squared = 0.8276
-------------+------------------------------ Adj R-squared = 0.8085
Total | 11.0454545 10 1.10454545 Root MSE = .45993
------------------------------------------------------------------------------
gdpgr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
invgr | .3969137 .0603774 6.57 0.000 .2603305 .5334969
_cons | .7032821 .4422039 1.59 0.146 -.2970526 1.703617
------------------------------------------------------------------------------
Please note that you don't need to include the intercept, because
STATA automatically includes it. In the output above you have the main
regression diagnostics (ANOVA, adjusted R-squared, t-statistics, sample
size, etc.). The same rule apply to multiple linear regressions. For
example, suppose you want to find the main sources of GDP growth. You
type:
regress gdpgr consgr invgr producgr unemp inf
Source | SS df MS Number of obs = 11
-------------+------------------------------ F( 5, 5) = 7.27
Model | 9.70924721 5 1.94184944 Prob > F = 0.0242
Residual | 1.33620731 5 .267241462 R-squared = 0.8790
-------------+------------------------------ Adj R-squared = 0.7581
Total | 11.0454545 10 1.10454545 Root MSE = .51695
------------------------------------------------------------------------------
gdpgr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
consgr | .1822094 .3605194 0.51 0.635 -.7445351 1.108954
invgr | .3448859 .1338048 2.58 0.050 .0009296 .6888422
producgr | .0490201 .1547288 0.32 0.764 -.3487228 .4467631
unemp | .0551669 .1897954 0.29 0.783 -.4327176 .5430514
inf | .3019558 .372596 0.81 0.455 -.6558326 1.259744
_cons | -.8865854 1.492931 -0.59 0.578 -4.724287 2.951116
------------------------------------------------------------------------------
In the example above, despite we have a high adjusted R-squared, most of the covariates are not significant at 5% level (actually, only the investments coefficient is significant at this level). There may be many problems in the regression above. On the Econ 508 classes you will learn how to solve most of those problems, including how to select the best specification for a model.
You can also run a log-linear regression after transforming each variable into a natural log scale. To do so, you type:gen lngdpgr=ln(gdpgr)
gen lnconsgr=ln(consgr)
gen lninvgr=ln(invgr)
gen lnproduc=ln(producgr)
gen lnunemp=ln(unemp)
gen lninf=ln(inf)
regress lngdpgr lnconsgr lninvgr lnproduc lnunemp lninf
Source | SS df MS Number of obs = 11
-------------+------------------------------ F( 5, 5) = 7.19
Model | 1.07467131 5 .214934262 Prob > F = 0.0247
Residual | .149400242 5 .029880048 R-squared = 0.8779
-------------+------------------------------ Adj R-squared = 0.7559
Total | 1.22407155 10 .122407155 Root MSE = .17286
------------------------------------------------------------------------------
lngdpgr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lnconsgr | .114882 .4666926 0.25 0.815 -1.08479 1.314554
lninvgr | .779761 .3081229 2.53 0.052 -.0122942 1.571816
lnproduc | .0950277 .1935535 0.49 0.644 -.4025174 .5925728
lnunemp | .2009322 .3716735 0.54 0.612 -.7544849 1.156349
lninf | .1184624 .2785439 0.43 0.688 -.5975574 .8344822
_cons | -.9912522 .787582 -1.26 0.264 -3.015796 1.033292
------------------------------------------------------------------------------
Finally, you can generate predicted values of the dependent variable and of the residuals, and plot them:
predict lngdpfit
scatter
lngdpfit year
predict lngdpres, resid
scatter
lngdpres
year
Linear Hypothesis Testing
After running the regressions above, we can proceed with tests of
linear hypothesis on the covariates. For example, suppose you would
like to be sure that investment growth "matters" to GDP growth.
Thus, you proceed with:
test lninvgr
( 1) lninvgr = 0
F( 1, 5) = 6.40
Prob > F = 0.0525
You just performed a F-test for the null hypothesis of lninvgr=0 against the alternative of lninvgr ~= 0. The computed F-statistic is the squared of the popular t-statistic. The result means that investment growth rates (in logs) are significantly different than zero at 5.25% level, and therefore they contribute to explain the variation in GDP growth rates (in logs).
To test the joint significance of two or more covariates, you type:
test
lninvgr lnconsgr lnproduc
( 1) lninvgr = 0
( 2) lnconsgr = 0
( 3) lnproduc = 0
F( 3, 5) = 11.40
Prob > F = 0.0113
Here you are testing the null hypothesis that all covariates are zero against the alternative hypothesis that at least one of them is different than zero. The result shows that we cannot accept the null at 1.13% of significance, i.e., some of them are significantly different than zero at this level. So, some of them "matter" in explaining the variation in GDP growth rates (logs) along the years.
You could also extend your tests and check the equality of
covariates. For example, suppose you would like to know if investments
and consumption have similar coefficients:
test
lninvgr=lnconsgr
( 1) - lnconsgr + lninvgr = 0
F( 1, 5) = 0.79
Prob > F = 0.4143
This is similar to test whether their difference is zero (null
hypothesis) or different than zero (alternative). The conclusion is
that, at 5% significance level, we cannot reject the null hypothesis of
similarity.
Please send comments to bottan2@illinois.edu or srmntbr2@illinois.edu