e-TA 13: Cubic B-Splines and Quantile Regression
Welcome to a new issue of e-Tutorial. This e-TA will focus on Cubic B-Splines and Quantile Regression.1
Data
You can download the data set, called weco14.txt from the Econ 508 web site. Save it in your preferred directory.
Then you can load it in R. If you saved it in your hard drive in a folder called econ508 you can load it by typing:
weco<-read.table("C:/econ508/weco14.txt", header=T, sep="")
or you can retrieve it from the web
weco<-read.table("http://www.econ.uiuc.edu/~econ508/data/weco14.txt", header=T, sep="")
head(weco)
y sex dex lex kwit job_tenure status
1 13.72579 0 38 10 FALSE 277 TRUE
2 17.15373 1 55 11 TRUE 173 TRUE
3 13.62992 1 45 12 FALSE 410 TRUE
4 13.03998 1 41 11 FALSE 247 TRUE
5 13.19510 1 42 10 FALSE 340 TRUE
6 16.43010 0 38 12 FALSE 517 TRUE
Cubic B-Splines
First we begin by estimating the model proposed in question 1 of PS5
\[ y = \alpha_{0} + \alpha_{1} sex + \alpha_{2} dex + \alpha_{3} lex + \alpha_{4} lex^2 + u \]
To estimate this model first we need to create \(lex^2\)
weco$lex2<-weco$lex^2
and then we are ready to estimate the model.
lm <- lm(y~sex+dex+lex+lex2,data=weco)
summary(lm)
Call:
lm(formula = y ~ sex + dex + lex + lex2, data = weco)
Residuals:
Min 1Q Median 3Q Max
-3.9094 -0.7227 0.0357 0.7647 3.1867
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.527936 2.031914 2.721 0.00668 **
sex -0.900641 0.087489 -10.294 < 2e-16 ***
dex 0.112088 0.006003 18.671 < 2e-16 ***
lex 0.821391 0.321307 2.556 0.01079 *
lex2 -0.036032 0.012808 -2.813 0.00505 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.124 on 678 degrees of freedom
Multiple R-squared: 0.3883, Adjusted R-squared: 0.3847
F-statistic: 107.6 on 4 and 678 DF, p-value: < 2.2e-16
Next we estimate a “(more) nonparametric version” using Cubic B-Splines. To do so we need first to load the splines
package
library(splines)
Then we are ready to estimate a model of the form
\[ y = \alpha_{0} + \alpha_{1} sex + \alpha_{2} dex + g(lex, \alpha) + u \]
where \(g(.)\) is a spline. In R
we write
spline <- lm(y ~ sex+dex+bs(lex), data=weco)
summary(spline)
Call:
lm(formula = y ~ sex + dex + bs(lex), data = weco)
Residuals:
Min 1Q Median 3Q Max
-3.9038 -0.7319 0.0386 0.7451 3.2079
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.938776 0.450386 22.067 <2e-16 ***
sex -0.897106 0.087714 -10.228 <2e-16 ***
dex 0.112310 0.006017 18.667 <2e-16 ***
bs(lex)1 0.367654 0.961526 0.382 0.702
bs(lex)2 0.740328 0.767352 0.965 0.335
bs(lex)3 -2.289159 1.123632 -2.037 0.042 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.124 on 677 degrees of freedom
Multiple R-squared: 0.3886, Adjusted R-squared: 0.3841
F-statistic: 86.06 on 5 and 677 DF, p-value: < 2.2e-16
Note that we have not specified neither the knot number, the location or the degree. However this can be easily done:
number.knots <- 4
sp <- seq(from=min(weco$lex),to=max(weco$lex),length=number.knots+2)[2:(number.knots+1)]
spline2 <- lm(y ~ sex+dex+ bs(lex, knots=sp, degree = 2), data=weco)
summary(spline2)
Call:
lm(formula = y ~ sex + dex + bs(lex, knots = sp, degree = 2),
data = weco)
Residuals:
Min 1Q Median 3Q Max
-3.9089 -0.7290 0.0341 0.7624 3.2233
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.196385 0.582351 17.509 <2e-16 ***
sex -0.894068 0.088057 -10.153 <2e-16 ***
dex 0.112072 0.006036 18.568 <2e-16 ***
bs(lex, knots = sp, degree = 2)1 -0.296163 0.671473 -0.441 0.659
bs(lex, knots = sp, degree = 2)2 0.050044 0.528230 0.095 0.925
bs(lex, knots = sp, degree = 2)3 -0.082151 0.573339 -0.143 0.886
bs(lex, knots = sp, degree = 2)4 -0.594566 0.700694 -0.849 0.396
bs(lex, knots = sp, degree = 2)5 -1.789575 1.836043 -0.975 0.330
bs(lex, knots = sp, degree = 2)6 -2.244919 1.245665 -1.802 0.072 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.126 on 674 degrees of freedom
Multiple R-squared: 0.3891, Adjusted R-squared: 0.3818
F-statistic: 53.65 on 8 and 674 DF, p-value: < 2.2e-16
to learn more about the bs()
function read carefully the help file ?bs
.
You can also plot the data with the regression spline overlain:
lex_val <- seq(from=min(weco$lex), to=max(weco$lex), length=1000)
weco_m<-data.frame(lex=lex_val, sex=1, dex=mean(weco$dex))
yhat_m<-predict(spline, newdata=weco_m)
weco_w<-data.frame(lex=lex_val, sex=0, dex=mean(weco$dex))
yhat_w<-predict(spline, newdata=weco_w)
plot(weco$lex, weco$y)
lines(weco_m$lex, yhat_m, col="blue")
lines(weco_w$lex, yhat_w, col="red")
Note that we have defined new data where we are going to evaluate our estimates and used those to plot.
Smoothing Splines
Previously we used cubic B- splines to check the quadratic specification of the problem. When we did so we defined a set of knots, which produced a sequence of basis functions, which in turn we used least squares to estimate the spline coefficients. In this subsection we look at a different approach. What we are doing basically is to fit a smooth curve. So what we really want is to find a function, say \(g(x)\), that fits the observed data. A way to do so is to use smoothing splines which basically finds the function g that minimizes
\[ \sum_{i=1} ^n =(y_i − g(x_i))^{2} + λ \int g′′(t)^2 dt \]
where λ is a nonnegative tuning parameter. The function g that minimizes this equation is known as a smoothing spline. (You can read more about smoothing splines on “An Introduction to Statistical Learning” (2013) by James, Witten, Hastie and Tibshirani.)
In R the smooth.spline
function is readily available. We can estimate the smoothing spline and plot it with the following line:
plot(weco$lex,weco$y)
lines(smooth.spline(weco$lex,weco$y, tol=0.0001), col= "red")
Note that the we added the tol
option. The function smooth.spline
uses as a default the IQR
times a small number as tolerance. However, in our case the lex variable has IQR
zero, which in turns if we do not specify the tol
option we would get an error message.
Quantile Regression
In Question 2 of PS5 we are asked to consider a quantile regression model that relates productivity, sex, dex and lex. For example we can think on a model of the form
\[ Q_{yi}(\tau|sex,dex,lex) = \alpha_0(\tau) + \alpha_1(\tau)sex_i +\alpha_2(\tau)+\alpha_3(\tau)lex_i+\alpha_4(\tau)lex_i ^2\]
where \(Q_{yi}(\tau|sex,dex,lex)\) is the \(\tau\)th conditional quantile. To estimate this model we need first to call the quantreg
package:
require(quantreg)
Loading required package: quantreg
Loading required package: SparseM
Attaching package: 'SparseM'
The following object is masked from 'package:base':
backsolve
And then simply invoke the rq
function
quantiles<- rq(y~sex+dex+lex+lex2,data=weco, tau=1:9/10)
Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique
summary(quantiles)
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.1
Coefficients:
coefficients lower bd upper bd
(Intercept) 5.62075 -4.07301 11.97694
sex -0.64677 -0.94562 -0.25659
dex 0.09684 0.08817 0.11135
lex 0.55009 -0.35755 2.17286
lex2 -0.02078 -0.08892 0.01371
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.2
Coefficients:
coefficients lower bd upper bd
(Intercept) 6.34850 0.03011 8.68471
sex -0.77493 -0.98150 -0.53937
dex 0.09881 0.08479 0.11955
lex 0.57540 0.23176 1.47909
lex2 -0.02409 -0.06233 -0.01049
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.3
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.97274 3.06279 8.39897
sex -0.76118 -0.99908 -0.57830
dex 0.10680 0.09359 0.12035
lex 0.79339 0.26424 1.02846
lex2 -0.03305 -0.04247 -0.01898
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.4
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.92704 0.36657 7.17517
sex -0.82228 -1.07392 -0.64663
dex 0.10970 0.09634 0.12242
lex 0.86859 0.51090 1.68904
lex2 -0.03717 -0.06287 -0.02413
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.5
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.92366 0.73782 9.04266
sex -0.89662 -1.17195 -0.75291
dex 0.11193 0.10428 0.12261
lex 0.92807 0.37136 1.52974
lex2 -0.04043 -0.06286 -0.02524
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.6
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.97463 0.52162 8.32821
sex -1.06129 -1.24953 -0.90259
dex 0.11337 0.10259 0.12758
lex 0.99550 0.49204 1.63279
lex2 -0.04390 -0.06810 -0.02558
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.7
Coefficients:
coefficients lower bd upper bd
(Intercept) 5.93979 3.18574 10.40087
sex -1.01875 -1.19764 -0.92246
dex 0.11995 0.10213 0.13310
lex 0.88328 0.17983 1.28561
lex2 -0.04162 -0.05794 -0.00579
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.8
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.93541 3.81895 7.93074
sex -0.98152 -1.15848 -0.86910
dex 0.11849 0.10980 0.13014
lex 1.11375 0.59723 1.28604
lex2 -0.05104 -0.05793 -0.02883
Call: rq(formula = y ~ sex + dex + lex + lex2, tau = 1:9/10, data = weco)
tau: [1] 0.9
Coefficients:
coefficients lower bd upper bd
(Intercept) 4.24133 2.36447 9.97680
sex -1.19893 -1.44073 -0.93744
dex 0.13548 0.10668 0.14862
lex 1.22332 -0.02308 1.43841
lex2 -0.05696 -0.06712 0.00823
Note that we have run the regression for \(\tau=0.1, 0.2, ..., 0.9\) by simply adding the option tau=1:9/10
. You can obtain the usual plot by simply calling the plot
function and adjusting the margins.
plot(summary(quantiles), mar=c(0.5,2,2,2))
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
Warning in rq.fit.br(x, y, tau = tau, ci = TRUE, ...): Solution may be
nonunique
You may have noticed some Warning messages in the output, you have nothing to worry about. Prof. Koenker explains this in a very clear way: “When computing the median from a sample with an even number of distinct values there is inherently some ambiguity about its value: any value between the middle order statistics is”a" median. Similarly, in regression settings the optimization problem solved by the “br” version of the simplex algorithm, modified to do general quantile regression identifies cases where there may be non uniqueness of this type. When there are “continuous” covariates this is quite rare, when covariates are discrete then it is relatively common, at least when tau is chosen from the rationals. For univariate quantiles R provides several methods of resolving this sort of ambiguity by interpolation, “br” doesn’t try to do this, instead returning the first vertex solution that it comes to. Should we worry about this? My answer would be no…"
Please send comments to bottan2@illinois.edu or srmntbr2@illinois.edu↩