e-TA 11: Panel Data: Hausman-Taylor Approach
Welcome to a new issue of e-Tutorial. Here we will apply Hausman-Taylor (1981) instrumental variables approach to the phuzics data of Problem Set 4. The estimation strategy is explained in Prof. Koenker’s Lecture Note 17. 1
Data
The first thing you need is to download the phuzics panel data set, called phuzics10.txt from the Econ 508 web site. Save it in your preferred directory.
The next step is loading the Data in R. If you saved it in your hard drive you can load it by typing:
phuzics<-read.table("C:/econ508/phuzics10.txt", header=T, sep="")
or you can retrieve it from the web
phuzics<-read.table("http://www.econ.uiuc.edu/~econ508/data/phuzics10.txt", header=T, sep="")
Next you need to extract each variable from the data set:
id<-phuzics$id # id : person identifier
yr<-phuzics$yr # yr : current year - 1900
phd<-phuzics$phd # phd : year of PhD - 1900
sex<-phuzics$sex # sex : gender (female==1)
rphd<-phuzics$rphd # rphd: rank of PhD
ru<-phuzics$ru # ru : dummy for research university(res.pos.==1)
y<-phuzics$y # y : page equivalent in current year
Y<-phuzics$Y # Y : discounted cumulative page equivalent
s<-phuzics$s # s : current annual salary
The first step towards the panel data estimation is to transform your data into group means and deviations of group means. To do so we use the PQ
function introduced in the previous e-TA, and available at the Econ 508 webpage (Routines, panel.R)
"PQ" <-function(h, id){
if(is.vector(h)) h <- matrix(h, ncol = 1)
Ph <- unique(id)
Ph <- cbind(Ph, table(id))
for(i in 1:ncol(h)) Ph <- cbind(Ph, tapply(h[, i], id, mean))
is <- tapply(id, id)
Ph <- Ph[is, - (1:2)]
Qh <- h - Ph
list(Ph=as.matrix(Ph), Qh=as.matrix(Qh), is=is)
}
You should apply this function for all variables used in your estimations. For example, you will see that the PQ
FUNCTION will be used inside the function htiv
, to run the Hausman-Taylor Instrumental Variables estimators.
The second step is to write your own program in order to compute the HTIV estimators. The Econ 508 webpage (Routines, panel.R) provides a function called htiv
for this. You can download the file in the same way you did above.
"htiv" <-function(x, y, id, d, z = NULL){
#input:
# x design matrix partitioned as given in d
# y response vector
# id strata indicator
# d list of column numbers indicating partitioning of x
# x[,d[[1]]] is x1 -- exogonous time varying vars
# x[,d[[2]]] is x2 -- endogonous time varying vars
# x[,d[[3]]] is z1 -- exogonous time invariant vars
# x[,d[[4]]] is z2 -- endogonous time invariant vars
# z may contain excluded exogonous variables if there are any
# NB. intercept is automatically included
x <- as.matrix(cbind(x, 1))
Tx <- PQ(x, id)
d[[3]] <- c(d[[3]], dim(x)[2])
Z <- cbind(z, Tx$Ph[, d[[1]]], Tx$Qh[, d[[1]]], Tx$Qh[, d[[2]]],
x[, d[[ 3]]])
r <- tsls(x, Z, y, int = F)$resid
Ti <- table(id)
Ti.inv <- 1/table(id)
rdot2 <- tapply(r, id, mean)^2
v <- lm(rdot2~Ti.inv)
v <- v$coef
theta <- as.vector(sqrt(v[2]/(v[2] + v[1] * Ti[Tx$is])))
x <- x - (1 - theta) * Tx$Ph
y <- y - (1 - theta) * PQ(y, id)$Ph
fit <- tsls(x, Z, y, int = F)
list(fit=fit,v=v)
}
Some details must be explained, though:
- If you haven’t run
PQ
andtsls
until now, please do so. Otherwisehtiv
will not work.
"tsls" <-function(x, z, y, int = T){
# two stage least squares
if(int){
x <- cbind(1, x)
z <- cbind(1, z)
}
xhat <- lm(x~z-1)$fitted.values
R <- lm(y ~ xhat -1)
R$residuals <- c(y - x %*% R$coef)
return(R)
}
- The most important detail: the user should specify the model, create new variables, and decide which variables will be included in the regression and/or treated as instruments.
Thus, it is essential to read Koenker’s Lecture 17 and Hausman-Taylor (1981), as well as a good interpretation of the PS4 and auxiliary papers, in order to understand what the function is doing and how you need to adjust it.
Estimating Phuzicists Productivity
In Problem Set 4 you are asked to explore “the phuzical revolution”. We will use this setting to see Hausman and Taylor’s approach at work. The model suggested in the Hints of the problem set is:
\[ log y_{it}= \Sigma_{s-1} ^{q} \rho log y_{it-s}+ f(t,t_{0i},t-t_0i,r_{i}) + u_{it} \]
so a working model may take the form:
\[ log y_{it} = \beta_0 + \Sigma_{s=1} ^2 \rho_i \log y_{it-s} + \beta_1e_{it} + \beta_2e^2_{i,t} + \beta_3 \frac{1}{e_{it} \times r_{i}} +\beta_4 gender_{i} + \beta_5 d80_i + \alpha_{it} + u_{it} \]
To estimate this model in R we must first arrange and “create” the variables needed. The first and most cumbersome part is to create the lag variables. One way to do it is
n <- length(phuzics[,1])
h <- as.matrix(phuzics)
h <- cbind(h[3:n,],h[3:n,]-h[2:(n-1),],h[2:(n-1),]-h[1:(n-2),]) #difference
h <- h[h[,10]==0,] #ignore obs whose first diff confounds people
h <- h[h[,19]==0,] #ignore obs whose second diff confounds people
h <- cbind(h[,1:9],h[,7:9]-h[,16:18],h[,7:9]-h[,16:18]-h[,25:27])
h <- cbind(h[,1:4],h[,2]-h[,3],(h[,2]-h[,3])^2,h[,5:15])
colnames(h)<- c("id","yr","phd","sex","exp","exp^2","rphd","ru", "y","Y","s","y_1","Y_1","s_1","y_2","Y_2","s_2")
To understand what is doing we can analyze line by line. First suppose you have the following data set, where it is sorted by id and yr in decreasing chronological order,
id | yr | y |
---|---|---|
1 | t-4 | \(y_{1t-4}\) |
1 | t-3 | \(y_{1t-3}\) |
1 | t-2 | \(y_{1t-2}\) |
1 | t-1 | \(y_{1t-1}\) |
1 | t | \(y_{1t}\) |
2 | t-3 | \(y_{2t-3}\) |
2 | t-2 | \(y_{2t-2}\) |
2 | t-1 | \(y_{2t-1}\) |
2 | t | \(y_{2t}\) |
First we declare the data frame to be a matrix, to do some calculations more easily. The next line:
h <- cbind(h[3:n,],h[3:n,]-h[2:(n-1),],h[2:(n-1),]-h[1:(n-2),])
is taking differences, the data now looks like:
id | yr | y | h[3:n,]-h[2:(n-1),] | h[2:(n-1),]-h[1:(n-2),] | ||||
---|---|---|---|---|---|---|---|---|
1 | t-2 | \(y_{1t-2}\) | 0 | 1 | \(y_{1t-2}-y_{1t-3}\) | 0 | 1 | \(y_{1t-3}-y_{1t-4}\) |
1 | t-1 | \(y_{1t-1}\) | 0 | 1 | \(y_{1t-1}-y_{1t-2}\) | 0 | 1 | \(y_{1t-2}-y_{1t-3}\) |
1 | t | \(y_{1t}\) | 0 | 1 | \(y_{1t}-y_{1t-1}\) | 0 | 1 | \(y_{1t-1}-y_{1t-2}\) |
2 | t-3 | \(y_{2t-3}\) | 1 | 3 | \(y_{2t-3}-y_{1t}\) | 0 | 1 | \(y_{1t}-y_{1t-1}\) |
2 | t-2 | \(y_{2t-2}\) | 0 | 1 | \(y_{2t-2}-y_{2t-1}\) | 1 | 3 | \(y_{2t-3}-y_{1t}\) |
2 | t-1 | \(y_{2t-1}\) | 0 | 1 | \(y_{2t-1}-y_{2t-2}\) | 0 | 1 | \(y_{2t-2}-y_{2t-3}\) |
2 | t | \(y_{2t}\) | 0 | 1 | \(y_{2t}-y_{2t-1}\) | 0 | 1 | \(y_{2t-1}-y_{2t-2}\) |
Next we have to get rid of those rows where we are confunding people
h <- h[h[,10]==0,] #ignore obs whose first diff confounds people
h <- h[h[,19]==0,] #ignore obs whose second diff confounds people
So now the data looks like, note that we are dismissing those lines were the difference in id is not zero
id | yr | y | \(h[3:n,]-h[2:(n-1),]\) | \(h[2:(n-1),]-h[1:(n-2),] \) | ||||
---|---|---|---|---|---|---|---|---|
1 | t-2 | \(y_{1t-2}\) | 0 | 1 | \(y_{1t-2}-y_{1t-3}\) | 0 | 1 | \(y_{1t-3}-y_{1t-4}\) |
1 | t-1 | \(y_{1t-1}\) | 0 | 1 | \(y_{1t-1}-y_{1t-2}\) | 0 | 1 | \(y_{1t-2}-y_{1t-3}\) |
1 | t | \(y_{1t}\) | 0 | 1 | \(y_{1t}-y_{1t-1}\) | 0 | 1 | \(y_{1t-1}-y_{1t-2}\) |
2 | t-1 | \(y_{2t-1}\) | 0 | 1 | \(y_{2t-1}-y_{2t-2}\) | 0 | 1 | \(y_{2t-2}-y_{2t-3}\) |
2 | t | \(y_{2t}\) | 0 | 1 | \(y_{2t}-y_{2t-1}\) | 0 | 1 | \(y_{2t-1}-y_{2t-2}\) |
The next line of code, differences once more but now between columns to get the desired lags
h <- cbind(h[,1:9],h[,7:9]-h[,16:18],h[,7:9]-h[,16:18]-h[,25:27])
the data now looks line:
id | yr | y | id | yr | \(h[,7:9]-h[,16:18]\) | id | yr | \(h[,7:9]-h[,16:18]-h[,25:27])\) |
---|---|---|---|---|---|---|---|---|
1 | t-2 | \(y_{1t-2}\) | 0 | 1 | \(y_{1t-2}-\left(y_{1t-2}-y_{1t-3}\right)=y_{1t-3}\) | 0 | 1 | \(y_{1t-2}-\left(y_{1t-2}-y_{1t-3}\right)-\left(y_{1t-3}-y_{1t-4}\right)=y_{1t-4}\) |
1 | t-1 | \(y_{1t-1}\) | 0 | 1 | \(y_{1t-2}\) | 0 | 1 | \(y_{1t-3}\) |
1 | t | \(y_{1t}\) | 0 | 1 | \(y_{1t-1}\) | 0 | 1 | \(y_{1t-2}\) |
2 | t-1 | \(y_{2t-1}\) | 0 | 1 | \(y_{2t-2}\) | 0 | 1 | \(y_{2t-3}\) |
2 | t | \(y_{2t}\) | 0 | 1 | \(y_{2t-1}\) | 0 | 1 | \(y_{2t-2}\) |
So we have create the lags we wanted. Then the two last lines are straight forward.
A second approach, that is somehow simpler is to take advantage of the package DataCombine
which is basically a wrapper for dplyr
a very powerful package to do data management
require(DataCombine)
h2<-phuzics
h2<-h2[order(id,-yr),] #have to order by group and time
h2<-slide(data=h2,Var="y", GroupVar="id", NewVar="y_1", slideBy=1) #first lag
h2<-h2[!(h2$id==209 | h2$id==212 | h2$id==378),] #get rid of the groups with one observation
h2<-slide(data=h2,Var="y", GroupVar="id", NewVar="y_2", slideBy=2)
h2<-na.omit(h2) #get rid of NA
h2$exp<-h2$yr-h2$phd
h2$exp.2<-(h2$yr-h2$phd)^2
You can inspect and compare both approaches. Next we need to create a couple of variables to be ready to run our estimation. First we transform in logarithm our dependent variable:
y <- log(h[,"y"])
Next we form our \(X\) matrix.
x <- h[,c(12,15,5:7,3,4)]
x[,1:2] <- log(x[,1:2])
x[,6] <- as.numeric(x[,6]>80) #vintage effect of phd
x[,5] <- 1/(x[,3]*x[,5]) #interaction experience and rank
id <- h[,1]
colnames(x) <- c("y_1","y_2","exp","exp2","ier","phd","sex")
head(x)
y_1 y_2 exp exp2 ier phd sex
[1,] 2.708050 1.098612 5 25 0.10000000 1 0
[2,] 2.995732 2.708050 6 36 0.08333333 1 0
[3,] 3.218876 2.995732 7 49 0.07142857 1 0
[4,] 2.833213 3.218876 8 64 0.06250000 1 0
[5,] 2.302585 2.833213 9 81 0.05555556 1 0
[6,] 2.197225 2.302585 10 100 0.05000000 1 0
Now we are ready to apply the HTIV approach. Note that the htiv
function takes 5 arguments:
- x which is the design matrix
- y the dependent variable
- id indicates the group
- d is a list and indicates how \(X\) is partitioned
- the first element is the position numbers in which are the exogenous time varying variables
- the second is the position number of the endogenous time varying variables
- the third is the position of the exogenous time invariant vars
- the fourth is the position of the endogenous time invariant vars.
- z is the vector that contain exclude exogenous variables, and the default is
NULL
vl <- list(3:4,c(1:2,5),6:7,NULL)
v <- htiv(x,y,id,vl)
print(v$v)
(Intercept) Ti.inv
0.03745804 0.13370602
print(summary(v$fit))
Call:
lm(formula = y ~ xhat - 1)
Residuals:
Min 1Q Median 3Q Max
-1.49494 -0.30204 -0.01099 0.29573 1.76926
Coefficients:
Estimate Std. Error t value Pr(>|t|)
xhaty_1 0.5432496 0.0124239 43.726 < 2e-16 ***
xhaty_2 -0.4260759 0.0118581 -35.931 < 2e-16 ***
xhatexp 0.1549161 0.0058075 26.675 < 2e-16 ***
xhatexp2 -0.0031500 0.0001298 -24.263 < 2e-16 ***
xhatier 2.1740321 0.4390081 4.952 7.56e-07 ***
xhatphd 0.0096239 0.0311961 0.308 0.758
xhatsex -0.0134985 0.0403353 -0.335 0.738
xhat 1.2470580 0.0734263 16.984 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4425 on 5440 degrees of freedom
Multiple R-squared: 0.9158, Adjusted R-squared: 0.9157
F-statistic: 7401 on 8 and 5440 DF, p-value: < 2.2e-16
References:
Hausman, Jerry, 1978, “Specification Tests in Econometrics,” Econometrica, 46, pp.1251-1271. Hausman, Jerry, and William Taylor, 1981, “Panel Data and Unobservable Individual Effects”, Econometrica, 49, No. 6, pp.1377-1398. Koenker, Roger, 2014, “Panel Data,” Lecture 17, mimeo, University of Illinois at Urbana-Champaign.
Please send comments to bottan2@illinois.edu or srmntbr2@illinois.edu↩