reference: Introduction to Statistical learning (2013) Tibshirani, Hastie
In this analysis we will work with the “College” dataset, which contains 777 observations with 18 variables. The goals is to predict how many applications a college will recieve “Apps” based off other predictors.
First examine the structure of the dataset:
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
The variables are numeric, except for the factor: Private. Let’s build a model matrix that automatically creates dummy variables, which is important because glmnet only accepts numeric variables.
X = model.matrix(Apps ~ . - 1, data = College)[, -1]
Y = model.matrix(~ Apps + 0, data = College)
data_df <- as.data.frame(cbind(X, Y))
Since we will be testing the performance of multiple models, we need to split the data into a test and training set.
set.seed(1)
train <- sample(1:nrow(X), replace = FALSE, round(nrow(X) * .25))
X.train <- X[train, ]
Y.train <- Y[train]
X.test <- X[-train, ]
Y.test <- Y[-train]
Let’s fit a least-squares regression, which can be done using the lm function. The 17 coefficients of the least-squares regression model are fit by minimizing the RSS, the sum of squared residuals.
set.seed(1)
LM.mod <- lm(Apps ~ . , data = data_df, subset = train)
LM.pred <- predict(LM.mod, newdata = data_df[-train, ])
RMSE <- function(x){sqrt(sum((x-Y.test)^2))/length(x)}
Percent.Error <- function(x){RMSE(x)/mean(Y) * 100}
LM.RMSE <- RMSE(LM.pred)
OLS.Percent.Err <- Percent.Error(LM.pred)
my_results <- tibble(model = "OLS",
RMSE = LM.RMSE,
)
knitr::kable(my_results, digits = 3)
model | RMSE |
---|---|
OLS | 45.324 |
The RMSE can be interpreted as an estimate for the standard deviation of predicted values. In the context of our College data, given a new set of observations, we could estimate the amount of applications a college will receive by the RMSE. The percent error, given by the RMSE over the average value of Y, gives an estimate of percent average deviation of our prediction. This helps interpret the error, given the average college receives 3002 and our percent error is 1.5099806%.
Ridge regression fits a linear model by minimizing the quanitity \(RSS + \lambda \sum_{j=1}^p\beta_j^2\), where \(\lambda \sum_{j=1}^p\beta_j^2\) also written \(\lambda||\beta||^2\) is called a shrinkage penalty and \(\lambda \geq 0\) is called a tuning parameter. At the extremes, \(\lambda = 0\) returns the least-squares estimate, and \(\lambda \rightarrow \infty\) returns the null model, where all the predictors are forced to \(0\). Note, that shrinkage is not applied to the intercept. Instead we first center the inputs and then estmiate \(\beta_0\) by \(\overline{y}\). Centering is also necessary because unlike in OLS, the scale of the predictors can influence the estimate. Centering predictors is done by dividing by their standard deviation: \(\tilde{x_{ij}} = \frac{x_{ij}}{\sqrt{\sum_{i=1}^N(x_{ij}-\overline{x_j})^2}}\). The \(\tilde{x_{ij}}\) are called standardized predictors and are used to compute standardized coefficients.
Unlike the least squares estimate which produces only one set of coefficients, ridge regression computes a path of coefficients \(\beta^R_{\lambda}\) which is a \(p-\)dimensional sequence indexed by \(\lambda\). Cross validation must be used the to select the \(\lambda\) such that \(\beta^R_{\lambda}\) minimizes the out-of-sample error.
Let’s fit a ridge regression. Also, we can manually specify a grid of values for \(\lambda\):
set.seed(1)
grid <- 10^ seq (10,-2, length =100)
ridge.mod <- glmnet(X.train, Y.train, alpha = 0, lambda = grid)
# alpha = 0 specifies that we are using a Ridge Regression.
# glmnet automatically does centering first; Standardize = TRUE by default
The glmnet object has several useful accessor methods that return information about the model. coef
returns a matrix of coefficients. There are 19 rows, one for each coefficient, and 100 columns corresponding to different lambda values.
dim(coef(ridge.mod))
## [1] 18 100
The lambda
attribute returns the lambda values we generated. Let’s looks at the coefficients calculated for the 10th, 50th, and 90th lambda.
el2norm <- function(x){sqrt(sum(x^2))}
el2norm.r <- function(col){el2norm(coef(ridge.mod)[, col])}
coef(ridge.mod)[,c(10,50,90)] %>% rbind( map_dbl(c(10,50,90), el2norm.r) ) %>% rbind(ridge.mod$lambda[c(10,50,90)])
## 20 x 3 sparse Matrix of class "dgCMatrix"
## s9 s49 s89
## (Intercept) 3.118028e+03 -2.415997e+03 231.81608434
## PrivateYes -2.696165e-02 -4.964185e+02 -661.19247088
## Accept 9.715573e-06 3.134358e-01 1.77826472
## Enroll 2.662444e-05 6.947055e-01 -1.27533920
## Top10perc 6.794253e-04 1.488282e+01 86.92547784
## Top25perc 5.983285e-04 1.088166e+01 -34.94138769
## F.Undergrad 5.199477e-06 1.273835e-01 0.05419474
## P.Undergrad 1.213187e-05 2.309573e-01 -0.03056227
## Outstate 3.958645e-07 7.614102e-03 -0.12828208
## Room.Board 4.759186e-06 1.399616e-01 0.28313187
## Books 3.523183e-05 5.714569e-01 0.26976880
## Personal 8.369619e-06 8.288952e-02 0.11487261
## PhD 6.726713e-04 8.114758e+00 -12.36805850
## Terminal 7.243822e-04 7.233512e+00 2.82203511
## S.F.Ratio 5.516312e-04 1.154257e+01 -13.51179410
## perc.alumni -3.098273e-04 -8.365358e+00 -1.81119850
## Expend 1.634436e-06 3.524842e-02 0.02118322
## Grad.Rate 3.206191e-04 8.064134e+00 8.80381612
## 3.118028e+03 2.466617e+03 707.19190066
## 8.111308e+08 1.149757e+04 0.16297508
The last two rows are the l-2-norm and the \(\lambda\) value. We can see that for \(\lambda = .16\) we have the highest shrinkage of the coefficients
We can also use the predict
method to estimate coefficients at a new value of \(\lambda\), say \(\lambda = 80\)
predict(ridge.mod, s=80, type="coefficients")
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -2.698107e+02
## PrivateYes -6.121242e+02
## Accept 1.593882e+00
## Enroll -5.005744e-01
## Top10perc 7.060799e+01
## Top25perc -2.290605e+01
## F.Undergrad 3.085139e-03
## P.Undergrad -1.522212e-04
## Outstate -1.151999e-01
## Room.Board 3.024099e-01
## Books 2.723027e-01
## Personal 6.011289e-02
## PhD -1.064534e+01
## Terminal -5.831304e-01
## S.F.Ratio -9.814125e+00
## perc.alumni -3.127758e+00
## Expend 2.834260e-02
## Grad.Rate 1.017593e+01
Let’s also look a the %Dev
column when we print our model directly.
# print(ridge.mod) output is too long
%Dev
is the Null Deviance of the model. If the Null Deviance is very small, it signifies that the proposed model does not perform better than the null model. We can see that when lambda is very high, the %Dev
is almost 0, but increases as lambda decreases, and the penalty on the coefficients is relaxed.
Rather than supplying a pre-defined grid, we can use cross-validation to select the best lambda. The function cv.glmnet
not only fits 100 models, but it tells us which one is the best.
set.seed(1)
ridge.cv <- cv.glmnet(X.train, Y.train, alpha = 0)
ridge.pred <- predict(ridge.cv, s="lambda.min", newx = X.test)
ridge.RMSE <- RMSE(ridge.pred)
bestlam <- ridge.cv$lambda.min
Here cv.glmnet
uses 10-fold cross validation (by default) and returns a cv.glmnet
object containing a sequence of models. The model that miminimized the MSE can be found using lambda.min
. Here the best lambda is 523.658 and its corresponding estimate of the test error is 44.828. We could access the coefficients of this model by calling coef(ridge.cv, s = "lambda.min")
and calculate their l-2-norm 1428.74.
The Lasso is similar to ridge regression, except that the penalty is applied to the l-1-norm norm of the coeffiecients: \(||\beta||_1\). That is, the lasso selects the best model by minimizing the quantity \(RSS + \lambda\sum_{j=1}^p|\beta_j|\). Unlike Ridge regression, however, the lasso shrinks some of the coefficients to zero, and thus performs variable selection. The advantage of the lasso is that it yields sparse models with less coefficients are easier to interperet.
set.seed(1)
lasso.cv <- cv.glmnet(X.train, Y.train, alpha = 1)
lasso.lam <- lasso.cv$lambda.min
coef(lasso.cv, s = "lambda.min")
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -103.66573558
## PrivateYes -580.85348775
## Accept 1.70406630
## Enroll -0.78559016
## Top10perc 74.87113588
## Top25perc -24.96415815
## F.Undergrad .
## P.Undergrad .
## Outstate -0.11343325
## Room.Board 0.26498802
## Books 0.22454134
## Personal 0.07655156
## PhD -9.13154723
## Terminal .
## S.F.Ratio -7.85770983
## perc.alumni -0.62191812
## Expend 0.01889981
## Grad.Rate 7.33869822
lasso.mod <- glmnet(X.train, Y.train, alpha = 1, lambda = grid)
lasso.pred <- predict(lasso.mod, s=lasso.lam, newx = X.test)
We can see that some of the variables have been dropped.
lasso.RMSE <- RMSE(lasso.pred)
my_results <- rbind(my_results, list("Ridge", ridge.RMSE), list("Lasso", lasso.RMSE))
knitr::kable(my_results, digits = 3)
model | RMSE |
---|---|
OLS | 45.324 |
Ridge | 44.828 |
Lasso | 43.967 |
For some reason, the ridge and lasso perform substantially worse than the ordinary least squares model.
library(pls)
set.seed(1)
PCR.mod <- pcr(Apps ~ ., subset = train, data = data_df, validation = "CV")
validationplot(PCR.mod, val.type = "MSEP")
The validation plot compares the performance of PCR based off the number of components that are chosen. The MSEP axis is actually the RSME used to measure the training error. The error stabilizes around 5 or 6 components, and overall decreases monotonically as we include more components. Using all components would defeat the purpose of PCR and actually return the OLS solutions, where each variable is its own component.
summary(PCR.mod)
## Data: X dimension: 194 17
## Y dimension: 194 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 5009 4484 2545 2515 2144 1938 1914
## adjCV 5009 4649 2535 2504 2085 1901 1880
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1908 1988 1992 1915 1899 1906 1959
## adjCV 1875 1947 1950 1874 1857 1864 1914
## 14 comps 15 comps 16 comps 17 comps
## CV 1924 1818 1820 1819
## adjCV 1885 1775 1777 1775
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 46.99 87.98 95.51 97.64 98.71 99.53 99.93
## Apps 30.24 78.36 79.67 92.75 92.79 92.83 92.85
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 99.98 100.00 100.00 100.00 100.00 100.00 100.00
## Apps 93.06 93.06 93.78 94.18 94.18 94.19 94.24
## 15 comps 16 comps 17 comps
## X 100.00 100.00 100.0
## Apps 95.09 95.09 95.2
The summary which reveals the training error and variance explained per number of components included, further shows that including more than 6 components gives no benefit. However it may be best to use even less components to create a better model. Three components should work, since it explains 93% of the variance.
PCR.pred <- predict(PCR.mod, X.test, ncomp = 3)
PCR.RMSE <- RMSE(PCR.pred)
my_results <- rbind(my_results, list("PCR", PCR.RMSE))
knitr::kable(my_results, digits = 3)
model | RMSE |
---|---|
OLS | 45.324 |
Ridge | 44.828 |
Lasso | 43.967 |
PCR | 73.931 |
Partial least sqaures is a supervised alternative to principle components regression. PLS attempts to choose components that explain variance in both the predictors and response.
set.seed(1)
PLS.mod <- plsr(Apps ~ ., data = data_df, subset = train, validation = "CV")
summary(PLS.mod)
## Data: X dimension: 194 17
## Y dimension: 194 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 5009 2540 2604 2246 1901 1916 1921
## adjCV 5009 2518 2624 2206 1868 1882 1886
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1980 2007 1976 1895 1900 1875 1832
## adjCV 1941 1964 1912 1853 1856 1826 1788
## 14 comps 15 comps 16 comps 17 comps
## CV 1818 1822 1824 1819
## adjCV 1775 1778 1779 1775
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 43.67 82.78 92.34 97.64 98.49 99.02 99.56
## Apps 79.68 82.55 90.51 92.83 92.89 92.97 93.05
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 99.98 99.98 100.0 100.00 100.00 100.00 100.00
## Apps 93.09 94.17 94.2 94.54 95.09 95.09 95.09
## 15 comps 16 comps 17 comps
## X 100.00 100.00 100.0
## Apps 95.09 95.12 95.2
validationplot(PLS.mod)
We can see that 3 components explains about 93% of the variance and gives a low training RMSE. It’s better to keep the amount of components lowe to decrease the variance of the model, even if more components give slightly lower training RMSE.
PLS.pred <- predict(PLS.mod, X.test, ncomp = 3)
PLS.RMSE <- RMSE(PLS.pred)
my_results <- rbind(my_results, list("PLS", PLS.RMSE))
knitr::kable(my_results, digits = 3)
model | RMSE |
---|---|
OLS | 45.324 |
Ridge | 44.828 |
Lasso | 43.967 |
PCR | 73.931 |
PLS | 53.542 |
Since PLS supervises the choice of components, it can be more effective than PCR for a regression problem when using less components. Notice how the RMSE at 1 component is alsmost 3 times that for PCR than PLS, but it becomes about equal for the inclusion of more components.
In this section, I write a function using caret to train all the models at once. Each model’s parameters is optimized by cross validation automatically, as specified in the trainControl function.
set.seed(1)
library(caret)
caret_RMSE <- function(x){sqrt(sum((x-Y.test)^2))/length(x)}
trainControl <- trainControl(method="cv", number=5)
my_caret <- function(method_name){
#this function accepts the name of the method and returns its RMSE from testing it on our specified College dataset
method_fit <- train(Apps~., data=data_df, method=method_name, metric="RMSE", preProc=c("center","scale"), trControl=trainControl)
method_predictions <- predict(method_fit, X.test)
method_RMSE <- caret_RMSE(method_predictions)
list(method_name, method_RMSE)
}
caret_names <- list("lm", "lasso", "ridge", "glmnet", "pcr", "pls", "lars")
Now I write a short pipeline to run the my_caret
function on the list of method names and display the my_results in a nice table.
map(caret_names, my_caret) %>%
transpose() %>%
map(unlist) %>%
set_names(c("models","RMSE")) %>%
as_tibble() %>%
arrange(RMSE)->
caret_table
knitr::kable(caret_table, digits = 3, booktabs = TRUE, caption = "Result produced by Caret. Models arranged by RMSE.")
models | RMSE |
---|---|
lasso | 40.485 |
glmnet | 40.490 |
ridge | 40.574 |
lm | 40.579 |
lars | 40.579 |
pls | 48.381 |
pcr | 69.426 |
The my_results are quite similar to running each model individually.The models seem to perform slightly better under caret, probably because of the preprocessing and optimization of parameters done by the caret algorithm.
Subset selection is a technique for selecting a subset of your original varaible to be predictors in your model. For example, let’s say we want to fit a least squares model to the training data, but we suspect that it will generalize better to testing data if we keep fewer variables. Perhaps some are reduntant or lack predictive power. In the Best Subset Selection procedure, we consider all possible models. Since we are working with the College data which has 17 predictor variables, this will require fitting \[\binom{17}{0} + \binom{17}{1} + \binom{17}{2} + \dots + \binom{17}{17} = \sum_{k = 0}^{17}\binom{17}{k} = 2^{17}\] models. A quick aside regarding the combinatoric identity: \(\sum_{k = 0}^{p}\binom{p}{k} = 2^{p}\). It arises from the question, what is the cardinality of the power set of a set containing \(n\) objects. That is, how many distinct subsets can be created from this set. We can apprach this two ways: iterate over each element and decide whether or not to include it in your subset: this gives \(2^p\) choices. Or compute individually the amount of subsets of size \(k\) from \(0\leq k \leq n\) by taking combinations, and then summing the counts. These two approaches are identical, hence the formula. Let’s apply:
library(leaps)
BEST.mod_9 = regsubsets(Apps ~ ., data_df)
summary(BEST.mod_9)
## Subset selection object
## Call: regsubsets.formula(Apps ~ ., data_df)
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## PrivateYes Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " "
## 3 ( 1 ) " " "*" " " "*" " " " "
## 4 ( 1 ) " " "*" " " "*" " " " "
## 5 ( 1 ) " " "*" "*" "*" " " " "
## 6 ( 1 ) " " "*" "*" "*" " " " "
## 7 ( 1 ) " " "*" "*" "*" "*" " "
## 8 ( 1 ) "*" "*" "*" "*" " " " "
## P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " " " " "
## 6 ( 1 ) " " "*" "*" " " " " " " " "
## 7 ( 1 ) " " "*" "*" " " " " " " " "
## 8 ( 1 ) " " "*" "*" " " " " "*" " "
## S.F.Ratio perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " "*" " "
## 4 ( 1 ) " " " " "*" " "
## 5 ( 1 ) " " " " "*" " "
## 6 ( 1 ) " " " " "*" " "
## 7 ( 1 ) " " " " "*" " "
## 8 ( 1 ) " " " " "*" " "
Th asterix indicated whether the variable is included in the model. The row indices indicate a model of that many variables, so each row is a separate model, each one containing one more variable than in the previous row. The algorithm calculates only up to 9-variable models by default. Lets try setting nvmax = 17
to include all models, and then extract some statistics.
BEST.mod_17 <- regsubsets(Apps ~ ., data_df, nvmax = 17)
bmod.summary <- summary(BEST.mod_17)
show_metrics <- function(my_summary){
metrics <- c("adjr2", "rsq", "bic", "cp")
best_df <- as.data.frame(`[`(my_summary, metrics))
best_df_melt <- best_df %>% gather(key = "metric", value = "value") %>% mutate(model = rep(1:17, 4))
(ggplot(data = best_df_melt, aes(x = model, y = value, color = metric)) +
geom_line() +
facet_grid(metric ~ ., scales = "free_y")) %>% print()
c(map_dbl(best_df[c("adjr2", "rsq")], which.max), map_dbl(best_df[c("bic", "cp")], which.min))
}
show_metrics(bmod.summary)
## adjr2 rsq bic cp
## 13 17 10 12
The graphs show how the error metric changes as we introduce more variables into the model. They seem to drop off rapidly and then hit a max or min, as displayed in the table.
These statistics are intended to estimate the test error of the model by making an adjustment to the training error to account for the bias due to overfitting in the training process.
Let’s interperet each of these statistics (exact formulas can be looked up):
\(R^2_{adj}\) modifies the denominator of \(RSS\) in \(R^2\) to \(RSS/(N - p - 1)\), where \(p\) is the number of predictor variables in our model, thus inflating the error as we add more predictors. Unlike \(R^2\) which will increase monotincally with additionaly variables,\(R^2_{adj}\) penalizes the addition of noise variables. Both these statistics measure the goodness-of-fit of the model on a scale of \(0\) to \(1\), so we want to choose the model that maximizes them.
\(C_p\) adds a penalty of \(p *\hat{\sigma}\) to the RSS, where sigma estimates the variance in the response. Clearly \(C_p\) increases with more predictors, so we want to minimize it.
\(AIC\) is proportional to \(C_p\) so it should be minimized
\(BIC\) adds a term of \(log(N) * p * \hat{\sigma}^2\) to the \(RSS\) where \(N\) is the number of obserations, so it penalizes models with many variables and observations. Thus, we want to mimize it. The more severe penalty given by the \(BIC\) also explains why it chose the sparsest model.
Clearly Best Subset selection is very computationally intensive, so shorter alternatives have been developed. Forward Selection works by adding in the best variable at each stage. That is, it chooses to include the variable that minimizes the model’s RSS, and then keeps that variable while choosing a new one from the remaining variables, until all are used. This my_results in having to test \(1 + \sum_{k=0}^{p-1}p - k = 1 + \frac{p(p+1)}{2}\) models. In our College dataset, \(p = 17\), so the total number of models we fit is \[1 + 17 + 16 + \dots + 2 + 1 = 1 + \frac{(17)(18)}{2} = 154\] This is substantially less computation. Note that the additional one is from fitting the null model. At each stage the best variable is chosen by comparing \(R^2\) amongst the different possible models. At the end, we get \(p+1\) models: \(M_0, \dots, M_{17}\). Since these models do not have the same number of variables, we cannot compare them directly using \(R^2\) (it will increase monotionically). Instead we can use one of the alternative metrics listed earlier.
FWD.mod <- regsubsets(Apps ~ ., data = data_df, nvmax = 17, method = "forward")
sum.FWD <- summary(FWD.mod)
show_metrics(sum.FWD)
## adjr2 rsq bic cp
## 13 17 10 12
The metrics are the same for both methods.
Let’s compare the speed of these algorithms.
BEST_time
## Unit: milliseconds
## expr min lq
## regsubsets(Apps ~ ., data = data_df, nvmax = 17) 3.297352 3.640682
## mean median uq max neval
## 4.174848 3.793609 3.981699 10.00558 100
FWD_time
## Unit: milliseconds
## expr
## regsubsets(Apps ~ ., data = data_df, nvmax = 17, method = "forward")
## min lq mean median uq max neval
## 2.980297 3.354433 4.710344 3.648824 4.159254 31.93668 100
We can see that forward selection is on average a few miliseconds faster than the full model selection.
Let’s compare the models they selected as determined by the BIC. Since the BIC was minimized by the model with 10 variables, we can see if they are the same variables for the Best subset and Forward selection.
sum.FWD$which[10, ] == bmod.summary$which[10, ]
## (Intercept) PrivateYes Accept Enroll Top10perc Top25perc
## TRUE TRUE TRUE TRUE TRUE TRUE
## F.Undergrad P.Undergrad Outstate Room.Board Books Personal
## TRUE TRUE TRUE TRUE TRUE TRUE
## PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## TRUE TRUE TRUE TRUE TRUE TRUE
It looks like both methods give the same my_results. Therefore we can opt for the faster, forward selection method. However, this was a small data set. my_results may vary for large high-dimensional data.
In this setion I we will fit a classification tree to the College data set to predict whether or not a college is private. In fact, we use the original dataset, not the data_df
that encodes the private factor as a dummy variable.
library(tree)
tree.College <- tree(Private ~ ., data = College)
summary(tree.College)
##
## Classification tree:
## tree(formula = Private ~ ., data = College)
## Variables actually used in tree construction:
## [1] "F.Undergrad" "Outstate" "P.Undergrad" "Top10perc" "perc.alumni"
## [6] "Enroll"
## Number of terminal nodes: 10
## Residual mean deviance: 0.2563 = 196.6 / 767
## Misclassification error rate: 0.04891 = 38 / 777
The misclassification error rate, which is the training error rate, is about \(5%\), which is quite good. The deviance reported is given by \[-2\sum_m\sum_kn_{nm}\log\hat{p}_{mk}\] Where \(n_{mk}\) is the number of observations in the \(m\)th terminal node that belong to the \(k\)th class. Where \(m\) is the number of terminal nodes of the tree, 10 in this case, and \(k\) can take \(2\) values: one for each class label (public or private). \(\hat{p}_{mk}\) is the class propotion for a specific node, so in this case it is \(\frac{n_{mk}}{n_{m1} + n_{m2}}\) for \(k = 0\) or \(k=1\). We can see that this formula for deviance looks like the cross-entropy. It follows that we can interpret it as a cumulative measure of node purity. To elaborate let’s discuss how the decision trees work. They divide up the predictor space into \(m\) regions which can be visualized as a cut-up rectangular box where each box is the region, or an upside-down tree where the terminal leaves are the regions.
For a classification tree, the class label that is a assigned to a data point in a certain region of the predictor space is the majority vote of class labels in that region. Since we only have 2 class labels, it would be whichever label is dominant. Therefore, misclassification error occurs when there are training points in a region where the majority vote has a different label. That’s why we want to measure the region’s or node’s purity: the percentage of training points that belong to a single label. By maximizing the purity, we minimize the training error. An easily interpretable measure of node purity is the Gini index: \[G = \sum_{k=1}^K\hat{p}_{mk}(1-\hat{p}_{mk})\] which gives the purity for the \(m\)th node. If all the training labels belong to almost a single class \(k\), we can see that \(G\) is almost 0 and the node is very pure. The cross-entropy and similarly the deviance have a similar interpretaion to the Gini index, and thus we want to minimize them to get the best classfier.
The residual mean deviance statistic reported is simply the deviance divided by \(n - |T|\) where \(n = 777\) is the number of observations, and \(|T| = 10\) is the number of terminal nodes. A small RMD indicates that the tree provides a good fit to the data.
We can print out a visual representation of the tree.
plot(tree.College)
text(tree.College)
We can see the most important determining variable is the number of full time undergraduates. From the leftmost branch we can see that a second split of the F.Undergrad
predictor was significant: smaller schools tend to be private. The out-of-state tuition is also an important fact: schools with smaller enrollments but higher tuition tend to be private, and if the tuition is just really high, it is probably private. Private schools also have a larger percentage of alumni who donate back.
To get an estimate of the test error we need to split the data into a training and validation set.
College.test <- College[-train, ]
Private.test <- College$Private[-train]
tree.college <- tree(Private ~ ., data = College, subset = train)
tree.pred <- predict(tree.college, College.test, type = "class")
tree.table <- table(tree.pred, Private.test)
The table function gives a confusion matrix to explain the performance of the classifier:
tree.table
## Private.test
## tree.pred No Yes
## No 133 31
## Yes 31 388
By dividing the counts in the off-diagonal by the total observations, we get the misclassification error:
error_rate = function(myTable) {(myTable[1,2] + myTable[2, 1]) / (sum(myTable))}
tree_error = error_rate(tree.table)
tree_error
## [1] 0.1063465
Fitting a deep classification tree may produce good predictions on the training set but is likely to overfit the data, leading to poor performance on the test set. It may be better to build a less complex tree that has more bias but lower variance when we test it against new data. To accomplish this we can use a formula that penalizes the complexity, or number of terminal nodes of the tree: \[\sum_{m = 1}^{|T|}\sum_{i:x_i\in R_m}(y_i - \hat{y_{R_m}})^2+\alpha |T|\] Alpha is the tuning parameter that adds a penalty for more leaves. When alpha is 0, we get the original tree. As it increases, we get a sequence of subtrees, with less or equal node to the previous tree. Cross validation can be used to select the optimal value for alpha. This process is done on the training data set. Once we choose an alpha, we can test that tree’s performance on the test set.
set.seed(1)
cv.college = cv.tree(tree.college, FUN = prune.misclass)
cv.college
## $size
## [1] 8 6 3 2 1
##
## $dev
## [1] 19 19 21 32 51
##
## $k
## [1] -Inf 0.000000 1.666667 14.000000 22.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
best_size <- function(x){x$dev %>% which.min() %>% `[`(x$size, .)}
best_size_1 <- best_size(cv.college)
Explanation of the output: size
refers to the number of terminal nodes in the tree, dev
is the CV error, k
is the tuning parameter \(\alpha\). We set FUN = prune.misclass
to indicate that classification error is what guides the tuning process, as opposed to the default metric: deviance.
plot_cv <- function(cv_object){ as.data.frame(cv_object[c('size', 'dev', 'k')]) %>% gather(key = "metric", value = "value") %>% mutate(index = rep(1:length(cv_object$size), 3)) %>%
ggplot(data = ., aes(x = index, y = value, col = metric)) +
geom_line()}
plot_cv(cv.college)
It looks like the CV error is minimized for a tree of size 8.
Now we can apply the prune.misclass
function to obtain this tree:
set.seed(1)
prune.college <- prune.misclass(tree.college, best = best_size_1)
plot(prune.college)
text(prune.college, pretty = 0)
We can see it incorporates the 2 most important variables previously identified, out-of-state tuition and number of full-time undergraduates.
prune.pred = predict(prune.college, College.test, type = "class")
prune.table <- table(prune.pred, Private.test)
prune.table
## Private.test
## prune.pred No Yes
## No 133 30
## Yes 31 389
error_rate(prune.table)
## [1] 0.1046312
Interestingly we have the same exact test error! If you compare both tables, you can see that the numbers are different but the sums are the same, yielding the same fraction. Now we have a much more interpretable tree with the same exact performance.
Regression trees work much like classification trees, except that we estimate points by taking the average of the training responses of the region that it falls into, whereas in classification we took the majority vote. Let’s see how a regression tree performs on our college dataset.
set.seed(1)
rtree.college <- tree(Apps ~ ., subset = train, data = College)
summary(rtree.college)
##
## Regression tree:
## tree(formula = Apps ~ ., data = College, subset = train)
## Variables actually used in tree construction:
## [1] "Accept" "Top25perc"
## Number of terminal nodes: 6
## Residual mean deviance: 5284000 = 993400000 / 188
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8766.0 -471.8 -106.9 0.0 266.9 23620.0
plot(rtree.college)
text(rtree.college, pretty = 0)
Notice that only two predictors were used in the construction of this tree, but they are split 6 times. The residual mean deviance is just the mean squared error for the tree. Accept
refers to the number of applications accepted. It makes perfect sense that this quantity would be correlated with Apps
which is the number of applications received. I did not expect that graduateion rate would be the other important variable; however, it is only used in one split.
rtree.yhat <- predict(rtree.college, College[-train, ])
TREE.RMSE <- RMSE(rtree.yhat)
TREE.RMSE
## [1] 63.64753
set.seed(1)
rcv.college = cv.tree(rtree.college)
plot_cv(rcv.college)
rcv.college
## $size
## [1] 6 5 4 3 2 1
##
## $dev
## [1] 2103518473 2137107171 2275506070 2892084499 3482195433 5160836679
##
## $k
## [1] -Inf 50248845 60525208 333091597 589316171 2790697793
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
best_size_2 <- ifelse(best_size(rcv.college) != 1, best_size(rcv.college), 5)
It looks like the best tree is of size 6. Note that the predict
function throws an error for a single-node tree.
set.seed(1)
rprune.college <- prune.tree(rtree.college, best = best_size_2)
plot(rprune.college)
text(rprune.college, pretty = 0)
rcollege.pred = predict(rprune.college, newdata = College[-train, ])
RMSE(rcollege.pred)
## [1] 63.64753
Noice that this performs exactly the same as the unpruned tree even though it uses Terminal
: percent of faculty with terminal degrees, as the secondary predictor.
my_results <- rbind(my_results, list("Tree", TREE.RMSE)) %>% arrange(RMSE)
knitr::kable(my_results, digits = 3)
model | RMSE |
---|---|
Lasso | 43.967 |
Ridge | 44.828 |
OLS | 45.324 |
PLS | 53.542 |
Tree | 63.648 |
PCR | 73.931 |
We can see that the regression tree performs worse than the optimized linear regression models. Perhaps we can infer that a linear decision boundary is a better fit for this data set. The tree will perform better when the predictor space can be divided up into rectangular regions.
Bagging is a technique to reduce the variance of a statistical learning procedure. If we fit a deep tree to the data, we may have overfitting, so the model will not generalize well. However, if we fit many trees and average the results we will have lower variance when generalizing. Bagging stands for bootstrap aggregation and works as follows: Take \(B\) bootstrap samples and fit a full tree to each sample. Our prediction for an observation \(x\) is \[\hat{f}_{bag}(x)=\frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x)\]
library(randomForest)
set.seed(1)
bag.college <- randomForest(Apps ~ ., data = College, subset = train, mtry = 17, importance = TRUE)
bag.college
##
## Call:
## randomForest(formula = Apps ~ ., data = College, mtry = 17, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 17
##
## Mean of squared residuals: 8663121
## % Var explained: 65.11
varImpPlot(bag.college)
To discuss these error metrics, we need to discuss \(OOB\): out-of-bag observations. For each observation, it will be left out of about \(B/3\) bootstrap sample, and is said to be an out-of-bag observation. TWe can estimate the error of a single observation by finding the residual each time it was out of bag, and averaging them. The root mean square of these residuals estimates the test error. The %IncMSE
reports the mean decrease of accuracy in predictions on the out-of-bag samples when a given variable is excluded from the model, where the \(OOB\) sample is the observations left out when we take a bootstrap sample. The IncNodePurity
is a measure of the total decrease in node impurity that results from splits on that variable, averaged over all trees. In case of regression trees, node impurity is measured using RSS
Note that bagging is a special case of random forests where all the predictors are used, so we can call the randomForest
function and specify mtry = 17
.
yhat.bag <- predict(bag.college, newdata = College[-train, ])
RMSE(yhat.bag)
## [1] 57.13291
This is an improvement over the plain regression tree.
Random forest provide an improvement over bagging by decorrelating the trees. With a random foresst, you are only allowedd to consider a subset of predictors during each split. For example, in our bagging model, for each split consider which of the 17 predictors will yield a greatest decrease in residual error and we split that predictor. With random forests we choose a smaller number, usually \(\sqrt{p}\) for classification and \(p/3\) for regressoin, so in this case \(6\) predictors. During each split we choose another random sample of \(6\) predictors from our original 17 and only consider the best split amongst those predictors. This decorrelates the trees especially when you have a couple of very strong predictors driving the splits. Then if you fit 500 trees, they’ll have similar specifications even if the bootstrapped samples are different. Random forests ensure you are considering the contributions of each predictor in making the best decision.
set.seed(1)
rf.college <- randomForest(Apps ~ ., data = College, subset = train, mtry = 6, importance = TRUE)
yhat.rf = predict(rf.college, newdata = College[-train, ])
RF.RMSE <- RMSE(yhat.rf)
RF.RMSE
## [1] 55.09523
The random forest did not improve over bagging.
importance(rf.college)
## %IncMSE IncNodePurity
## Private 2.3399770 13631854
## Accept 18.7177910 1763876068
## Enroll 12.4804402 1118401360
## Top10perc 2.7414120 108871243
## Top25perc 1.4729585 139928982
## F.Undergrad 9.3128040 818290221
## P.Undergrad 2.6593772 226790494
## Outstate -2.1680527 61126749
## Room.Board 0.3275687 28617912
## Books -2.9301220 65301093
## Personal -2.7154225 57527289
## PhD -0.2005949 59326427
## Terminal -1.9788116 48049791
## S.F.Ratio 1.1426781 126809177
## perc.alumni 0.7150447 34342168
## Expend 3.6025646 41975290
## Grad.Rate 3.6260142 82016293
varImpPlot(rf.college)
The three clear most important variables are the number of accepted applicants, number of new students enrolled, and number of fulltime undergraduates. Even though the performance on this test set was not much better than bagging, I believe this model would generalize much better. The bagging model was dominated by the Accept
predictor and did not properly weight the contributions of Enroll
and F.Undergrad
, even though they were still identified as the most important. Instead the mysterious Top25perc
found its way to the top, just like in the original tree.
Boosting is a “slow learning” method creates an additive model after training trees on bootstrapped samples.. The algorithm for boosting begins by assuming that your prediction for yhat is 0, so your residual is yhat - 0 = yhat. It fits a tree on each bootstrapped sample, each time adding a small increment to yhat, and decreasing the residual. Each prediction is actually the sum of small predictions made on each bootstrapped set, where each new prediction is made in order to decrease the residual slightly more. The output is the boosted additive model. Here is the algorithm as shown in the ISLR book:
Lambda is the shrinkage parameter applied to each boostrap prediction. The parameter \(d\), called the ineraction depth, can be quite small, and results in fitting small trees. By fitting small trees to the residual, we slowly improve f.hat in areas where it does not perform well. Since each tree takes into account trees that have already been grown, small trees are sufficient, which makes this model much different from the bagging approach.
library(gbm)
set.seed(1)
boost.college <- gbm(Apps ~ ., data = College[-train, ], distribution = "gaussian", n.trees = 5000, interaction.depth = 4)
summary(boost.college)
## var rel.inf
## Accept Accept 71.03106826
## Enroll Enroll 13.47969809
## Top25perc Top25perc 4.06415091
## F.Undergrad F.Undergrad 2.92793274
## Top10perc Top10perc 2.26276392
## Outstate Outstate 1.24563265
## Terminal Terminal 1.19311365
## Grad.Rate Grad.Rate 1.10654086
## Expend Expend 0.85114060
## PhD PhD 0.64147680
## Room.Board Room.Board 0.45554150
## S.F.Ratio S.F.Ratio 0.21998081
## P.Undergrad P.Undergrad 0.16033008
## Books Books 0.11381049
## perc.alumni perc.alumni 0.11320335
## Personal Personal 0.10219142
## Private Private 0.03142387
Note that the distribution assumed on the response is gaussian, since this is a regression problem (bernoulli for classification), and the shrinkage parameter is by default set to \(.001\). The relative influence plot confirms that accept and enroll are by far the most important variables.
yhat.boost <- predict(boost.college, newdata = College[-train, ], n.trees = 5000)
BOOST.RMSE <- RMSE(yhat.boost)
BOOST.RMSE
## [1] 25.14018
Wow, boosting emerges as the clear winner amongst all the models tested so far.
my_results <- rbind(my_results, list("Boost", BOOST.RMSE), list("RF", RF.RMSE)) %>% arrange(RMSE)
knitr::kable(my_results, digits = 3)
model | RMSE |
---|---|
Boost | 25.140 |
Lasso | 43.967 |
Ridge | 44.828 |
OLS | 45.324 |
PLS | 53.542 |
RF | 55.095 |
Tree | 63.648 |
PCR | 73.931 |
Actually this result was unexpected. I have run this same code without setting the seed on the train/test split to get different results, and boosting consistently performed similar to ridge and lasso. However, it seems to perform anomolously well when setting the seed to 1.
Support vector machines can be used for binary classification problems. They partition the predictor space by a separating hyperplace (a flat, affine, n-1 dimensional subspace), which aims to classify all points on either side into a single label, if the data is indeed linearly seprable. Here we can apply it to predict whether or not a college is public or private.
set.seed(1)
library(e1071)
svm.fit <- svm(Private ~ ., data = College, kernel = "linear", cost = 10, scale = FALSE)
summary(svm.fit
)
##
## Call:
## svm(formula = Private ~ ., data = College, kernel = "linear",
## cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
## gamma: 0.05882353
##
## Number of Support Vectors: 178
##
## ( 54 124 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
The summary tells us there are 178 support vectors: 54 for the class ‘No’ and 124 for the class ‘Yes’. Support vectors are vectors that lie on the margin of the hyperplane or on the wrong side of its label. The margin is the minimal distance of the training observations to the hyperplane, and is a quantity we want to maximize to determine the optimal separating hyperplane.
svm.fit$index %>% head(n = 15)
## [1] 7 27 40 43 58 60 67 70 82 111 143 173 179 200 222
length(svm.fit$index)
## [1] 178
The ‘index’ attribute gives the indices of the observations that were used as support vectors.
The cost parameter determine how tolerant an SVC is to misclassified points. It can be interpreted as an “error budget” so increasing it allows for additional points to be misclassified in exchange for more robust prediction of other points. Generally, increasing the cost increases the bias and decreases the variance by preventing overfitting.
Let’s try using a smaller value of the cost parameter:
set.seed(1)
tune.out <- tune(svm, Private ~ ., data = College[train, ], kernel = "linear",
ranges = list(cost = c(.001, .01, .1, 1, 5, 10, 100, 200, 300, 400, 500 ,600, 1000)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 100
##
## - best performance: 0.03105263
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.24157895 0.07490427
## 2 1e-02 0.11815789 0.08518669
## 3 1e-01 0.09289474 0.08412871
## 4 1e+00 0.10342105 0.07427758
## 5 5e+00 0.07736842 0.05089534
## 6 1e+01 0.06710526 0.05510629
## 7 1e+02 0.03105263 0.03648111
## 8 2e+02 0.05657895 0.05702766
## 9 3e+02 0.05684211 0.05752407
## 10 4e+02 0.06184211 0.05901690
## 11 5e+02 0.06184211 0.05901690
## 12 6e+02 0.06184211 0.05901690
## 13 1e+03 0.06184211 0.05901690
Cross validation shows that the classifier performs best when the cost is set to 100
.
The tune.out
object stores the best model:
svm.clf = tune.out$best.model
summary(svm.clf)
##
## Call:
## best.tune(method = svm, train.x = Private ~ ., data = College[train,
## ], ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100,
## 200, 300, 400, 500, 600, 1000)), kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 100
## gamma: 0.05882353
##
## Number of Support Vectors: 22
##
## ( 11 11 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
The best model has a drastically larger cost parameter value.
yhat_svm = predict(svm.clf, College[-train, ])
svm.table = table(yhat_svm, College$Private[-train])
svm.table
##
## yhat_svm No Yes
## No 131 15
## Yes 33 404
svm_error = error_rate(svm.table)
clf_results = tibble(classifier = list('tree', 'svm'),
error = c(tree_error, svm_error)) %>% arrange(error)
knitr::kable(clf_results, digits = 3)
classifier | error |
---|---|
svm | 0.082 |
tree | 0.106 |
The support vector machine classifier outperforms the decision tree.