data(usc, package = "ACSWR")Multiple Linear Regression
Numecial Vs. multiple numerical
Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objective
- Carry out multiple linear regression with R functions.
- Visualization of the multiple linear regression model.
- Step-by-step model simplification.
2 Principle
2.1 Definition
- Linear Regression:
- A statistical method to determine the independent quantitative relationship between one or more predictor variables and one dependent variable.
- Simple Linear Regression:
- only one predictor variable.
- Multiple linear regression:
- Two or more predictor variables.
Selection of predictor variables:
- The predictor variables must have a significant influence on the dependent variable and show a close linear correlation;
- The predictor variables should be mutually exclusive. The degree of correlation between independent variables should not be higher than the degree of correlation between independent variables and dependent variables;
- The predictor variables should have complete statistical data, and their predicted values can be easily determined.
2.2 Model
\[y_i= \beta_0 + \beta_1X_{i1} + ... + \beta_kX_{ik} + \varepsilon_i \]
- X: predictor variable
- y: dependent variable
- \(\beta_k\): regression coefficient.
- \(\epsilon\): effect of other random factors.
Example: House price: location, area, surrounding facilities.
\[H= \beta_0 + \beta_1 L + \beta_2A + \beta_3S + \varepsilon \]
2.3 Workflow
- Fit the model
- Diagnose the model
- Simplify the model
3 Example
3.1 Objective
Use 13 explanatory variables to predict the crime rate1:
1 A Course in Statistics with R. Chapter 12
- DV: Crime rate
- IV: 13 explanatory variables
3.2 Visualization
pairs(usc)
GGally::ggpairs(usc)ucor <- cor(usc)
ucor
corrplot::corrplot(ucor, order = 'AOE', type = "upper", method = "number")
corrplot::corrplot(ucor, order = "AOE", type = "lower", method = "ell",
diag = FALSE, tl.pos = "n", cl.pos = "n", add = TRUE)- A strong relationship between Ex0 and Ex1: multicollinearity.
3.3 Fit the model
crime_rate_lm <- lm(R ~ Age + S + Ed + Ex0 + Ex1 + LF + M + N + NW + U1 + U2 +
W + X, data = usc)
# or
crime_rate_lm <- lm(R ~ ., data = usc)
summary(crime_rate_lm)
confint(crime_rate_lm)
anova(crime_rate_lm)- The intercept terms, Age, ED, U2, and X are significant variables to explain the crime rate. The 95% confidence intervals also confirm it.
- The model is significant: \(p < 0.05\), adjusted \(R^2 = 0.6783\).
The difference between \(R^2\) and Adj-\(R^2\):
get_R2 <- function(x, method){
round(summary(lm(usc$R ~ as.matrix(usc[, 2:x])))[[method]], 3)
}
dtf_R2 <- data.frame(n = 1:13,
R2 = sapply(2:14, get_R2, method = 'r.squared'),
AdjR2 = sapply(2:14, get_R2, method = 'adj.r.squared'))
library(ggplot2)
library(tidyr)
dtf_R2 |>
pivot_longer(-n) |>
ggplot() +
geom_point(aes(n, value, col = name)) +
geom_line(aes(n, value, col = name))3.4 Diagnose the model
Multicollinearity
- multi-: multiple columns
- col-: relationship
- linearity: linear
Problems:
- Imprecise estimates of \(\beta\)
- The t-tests may fail to reveal significant factors
- Missing importance of predictors
# SLR
data(Euphorbiaceae, package = 'gpk')
dtf <- data.frame(x1 = Euphorbiaceae$GBH,
y = Euphorbiaceae$Height)
plot(dtf)lm(y ~ x1, data = dtf) |> summary()# MLR
dtf$x2 <- jitter(dtf$x)
pairs(dtf)lm(y ~ x1 + x2, data = dtf) |> summary()
# or
lm(y ~ ., data = dtf) |> summary()cor(usc$Ex0,usc$Ex1)Solution: Variance inflation factor (VIF)
\[\mathrm{VIF}_j = \frac{1}{1-R_j^2}\]
- \(\mathrm{VIF}_j\): VIF for the \(j\)-th variable
- \(R^2_j\): \(R^2\) from the regression of the \(j\)-th explanatory variable on the remaining explanatory variables.
uscrimewor <- usc[, -1]
faraway::vif(uscrimewor)Rule of thumb: VIF > 10 (\(R^2 > 0.9\)).
Drop the variable with highest VIF:
uscrimewor2 <- uscrimewor[, c('Age','S','Ed','Ex0','LF','M','N','NW','U1','U2','W','X')]Update the VIFs and the model:
faraway::vif(uscrimewor2)
crime_rate_lm2 <- lm(R ~ Age + S + Ed + Ex0 + LF + M + N + NW + U1 + U2 + W + X, usc)
summary(crime_rate_lm2)- All updated VIFs < 10. No multicollinearity.
- The 12 explanatory variables account for 76.7% of the variability in the crime rates of the 47 states.
3.5 Simplify the model
3.5.1 Backward selection
- Start with all the predictors in the model.
- Remove the predictor with highest p-value greater than \(\alpha\) (= 0.05 usually).
- Refit the model and go to 2.
- Stop when all p-values are less than \(\alpha\).
get_p <- function(model){
smry2 <- data.frame(summary(model)$coefficients)
smry2[order(-smry2$Pr...t..), ]
}
get_p(crime_rate_lm2)
crime_rate_lm3 <- update(crime_rate_lm2, . ~ . - NW)
get_p(crime_rate_lm3)
crime_rate_lm4 <- update(crime_rate_lm3, . ~ . - LF)
get_p(crime_rate_lm4)
crime_rate_lm5 <- update(crime_rate_lm4,.~.-N)
get_p(crime_rate_lm5)
crime_rate_lm6 <- update(crime_rate_lm5,.~.-S)
get_p(crime_rate_lm6)
crime_rate_lm7 <- update(crime_rate_lm6,.~.-M)
get_p(crime_rate_lm7)
crime_rate_lm8 <- update(crime_rate_lm7,.~.-U1)
get_p(crime_rate_lm8)
crime_rate_lm9 <- update(crime_rate_lm8,.~.-W)
get_p(crime_rate_lm9)
summary(crime_rate_lm9)3.5.2 Forward selection
- Start with no variables in the model.
- For all predictors not in the model, check their p-value if they are added to the model. Choose the one with lowest p-value less than alpha critical.
- Continue until no new predictors can be added.
3.5.3 AIC selection
Akaike Information Criteria (AIC): big = bad
crime_rate_aic <- step(crime_rate_lm, direction = "both")
summary(crime_rate_aic)3.6 Results
According to the backward selection:
\[ \operatorname{\widehat{R}} = -524.37 + 1.02(\operatorname{Age}) + 2.03(\operatorname{Ed}) + 1.23(\operatorname{Ex0}) + 0.91(\operatorname{U2}) + 0.63(\operatorname{X}) \]
| Dependent variable: | |
| R | |
| Age | 1.020*** |
| (0.353) | |
| Ed | 2.031*** |
| (0.474) | |
| Ex0 | 1.233*** |
| (0.142) | |
| U2 | 0.914** |
| (0.434) | |
| X | 0.635*** |
| (0.147) | |
| Constant | -524.374*** |
| (95.116) | |
| Observations | 47 |
| R2 | 0.730 |
| Adjusted R2 | 0.697 |
| Residual Std. Error | 21.301 (df = 41) |
| F Statistic | 22.129*** (df = 5; 41) |
| Note: | p<0.1; p<0.05; p<0.01 |
The model shows that the variables Age, Ed, Ex0, U2, and X explain 73% of the variability in the crime rates.
According to the AIC selection:
\[ \operatorname{\widehat{R}} = -618.5 + 1.13(\operatorname{Age}) + 1.82(\operatorname{Ed}) + 1.05(\operatorname{Ex0}) + 0.83(\operatorname{U2}) + 0.16(\operatorname{W}) + 0.82(\operatorname{X}) \]
| Dependent variable: | |
| R | |
| Age | 1.125*** |
| (0.351) | |
| Ed | 1.818*** |
| (0.480) | |
| Ex0 | 1.051*** |
| (0.175) | |
| U2 | 0.828* |
| (0.427) | |
| W | 0.160* |
| (0.094) | |
| X | 0.824*** |
| (0.181) | |
| Constant | -618.503*** |
| (108.246) | |
| Observations | 47 |
| R2 | 0.748 |
| Adjusted R2 | 0.710 |
| Residual Std. Error | 20.827 (df = 40) |
| F Statistic | 19.771*** (df = 6; 40) |
| Note: | p<0.1; p<0.05; p<0.01 |
The model shows that the variables Age, Ed, Ex0, U2, W, and X explain 75% of the variability in the crime rates.
4 Further readings
STAT 501, Pennsylvania State University