data(usc, package = "ACSWR")
Multiple Linear Regression
Numecial Vs. multiple numerical
Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objective
- Carry out multiple linear regression with R functions.
- Visualization of the multiple linear regression model.
- Step-by-step model simplification.
2 Principle
2.1 Definition
- Linear Regression:
- A statistical method to determine the independent quantitative relationship between one or more predictor variables and one dependent variable.
- Simple Linear Regression:
- only one predictor variable.
- Multiple linear regression:
- Two or more predictor variables.
Selection of predictor variables:
- The predictor variables must have a significant influence on the dependent variable and show a close linear correlation;
- The predictor variables should be mutually exclusive. The degree of correlation between independent variables should not be higher than the degree of correlation between independent variables and dependent variables;
- The predictor variables should have complete statistical data, and their predicted values can be easily determined.
2.2 Model
\[y_i= \beta_0 + \beta_1X_{i1} + ... + \beta_kX_{ik} + \varepsilon_i \]
- X: predictor variable
- y: dependent variable
- \(\beta_k\): regression coefficient.
- \(\epsilon\): effect of other random factors.
Example: House price: location, area, surrounding facilities.
\[H= \beta_0 + \beta_1 L + \beta_2A + \beta_3S + \varepsilon \]
2.3 Workflow
- Fit the model
- Diagnose the model
- Simplify the model
3 Example
3.1 Objective
Use 13 explanatory variables to predict the crime rate1:
1 A Course in Statistics with R. Chapter 12
- DV: Crime rate
- IV: 13 explanatory variables
3.2 Visualization
pairs(usc)
::ggpairs(usc) GGally
<- cor(usc)
ucor
ucor::corrplot(ucor, order = 'AOE', type = "upper", method = "number")
corrplot::corrplot(ucor, order = "AOE", type = "lower", method = "ell",
corrplotdiag = FALSE, tl.pos = "n", cl.pos = "n", add = TRUE)
- A strong relationship between Ex0 and Ex1: multicollinearity.
3.3 Fit the model
<- lm(R ~ Age + S + Ed + Ex0 + Ex1 + LF + M + N + NW + U1 + U2 +
crime_rate_lm + X, data = usc)
W # or
<- lm(R ~ ., data = usc)
crime_rate_lm summary(crime_rate_lm)
confint(crime_rate_lm)
anova(crime_rate_lm)
- The intercept terms, Age, ED, U2, and X are significant variables to explain the crime rate. The 95% confidence intervals also confirm it.
- The model is significant: \(p < 0.05\), adjusted \(R^2 = 0.6783\).
The difference between \(R^2\) and Adj-\(R^2\):
<- function(x, method){
get_R2 round(summary(lm(usc$R ~ as.matrix(usc[, 2:x])))[[method]], 3)
}<- data.frame(n = 1:13,
dtf_R2 R2 = sapply(2:14, get_R2, method = 'r.squared'),
AdjR2 = sapply(2:14, get_R2, method = 'adj.r.squared'))
library(ggplot2)
library(tidyr)
|>
dtf_R2 pivot_longer(-n) |>
ggplot() +
geom_point(aes(n, value, col = name)) +
geom_line(aes(n, value, col = name))
3.4 Diagnose the model
Multicollinearity
- multi-: multiple columns
- col-: relationship
- linearity: linear
Problems:
- Imprecise estimates of \(\beta\)
- The t-tests may fail to reveal significant factors
- Missing importance of predictors
# SLR
data(Euphorbiaceae, package = 'gpk')
<- data.frame(x1 = Euphorbiaceae$GBH,
dtf y = Euphorbiaceae$Height)
plot(dtf)
lm(y ~ x1, data = dtf) |> summary()
# MLR
$x2 <- jitter(dtf$x)
dtfpairs(dtf)
lm(y ~ x1 + x2, data = dtf) |> summary()
# or
lm(y ~ ., data = dtf) |> summary()
cor(usc$Ex0,usc$Ex1)
Solution: Variance inflation factor (VIF)
\[\mathrm{VIF}_j = \frac{1}{1-R_j^2}\]
- \(\mathrm{VIF}_j\): VIF for the \(j\)-th variable
- \(R^2_j\): \(R^2\) from the regression of the \(j\)-th explanatory variable on the remaining explanatory variables.
<- usc[, -1]
uscrimewor ::vif(uscrimewor) faraway
Rule of thumb: VIF > 10 (\(R^2 > 0.9\)).
Drop the variable with highest VIF:
<- uscrimewor[, c('Age','S','Ed','Ex0','LF','M','N','NW','U1','U2','W','X')] uscrimewor2
Update the VIFs and the model:
::vif(uscrimewor2)
faraway<- lm(R ~ Age + S + Ed + Ex0 + LF + M + N + NW + U1 + U2 + W + X, usc)
crime_rate_lm2 summary(crime_rate_lm2)
- All updated VIFs < 10. No multicollinearity.
- The 12 explanatory variables account for 76.7% of the variability in the crime rates of the 47 states.
3.5 Simplify the model
3.5.1 Backward selection
- Start with all the predictors in the model.
- Remove the predictor with highest p-value greater than \(\alpha\) (= 0.05 usually).
- Refit the model and go to 2.
- Stop when all p-values are less than \(\alpha\).
<- function(model){
get_p <- data.frame(summary(model)$coefficients)
smry2 order(-smry2$Pr...t..), ]
smry2[
}get_p(crime_rate_lm2)
<- update(crime_rate_lm2, . ~ . - NW)
crime_rate_lm3 get_p(crime_rate_lm3)
<- update(crime_rate_lm3, . ~ . - LF)
crime_rate_lm4 get_p(crime_rate_lm4)
<- update(crime_rate_lm4,.~.-N)
crime_rate_lm5 get_p(crime_rate_lm5)
<- update(crime_rate_lm5,.~.-S)
crime_rate_lm6 get_p(crime_rate_lm6)
<- update(crime_rate_lm6,.~.-M)
crime_rate_lm7 get_p(crime_rate_lm7)
<- update(crime_rate_lm7,.~.-U1)
crime_rate_lm8 get_p(crime_rate_lm8)
<- update(crime_rate_lm8,.~.-W)
crime_rate_lm9 get_p(crime_rate_lm9)
summary(crime_rate_lm9)
3.5.2 Forward selection
- Start with no variables in the model.
- For all predictors not in the model, check their p-value if they are added to the model. Choose the one with lowest p-value less than alpha critical.
- Continue until no new predictors can be added.
3.5.3 AIC selection
Akaike Information Criteria (AIC): big = bad
<- step(crime_rate_lm, direction = "both")
crime_rate_aic summary(crime_rate_aic)
3.6 Results
According to the backward selection:
\[ \operatorname{\widehat{R}} = -524.37 + 1.02(\operatorname{Age}) + 2.03(\operatorname{Ed}) + 1.23(\operatorname{Ex0}) + 0.91(\operatorname{U2}) + 0.63(\operatorname{X}) \]
Dependent variable: | |
R | |
Age | 1.020*** |
(0.353) | |
Ed | 2.031*** |
(0.474) | |
Ex0 | 1.233*** |
(0.142) | |
U2 | 0.914** |
(0.434) | |
X | 0.635*** |
(0.147) | |
Constant | -524.374*** |
(95.116) | |
Observations | 47 |
R2 | 0.730 |
Adjusted R2 | 0.697 |
Residual Std. Error | 21.301 (df = 41) |
F Statistic | 22.129*** (df = 5; 41) |
Note: | p<0.1; p<0.05; p<0.01 |
The model shows that the variables Age, Ed, Ex0, U2, and X explain 73% of the variability in the crime rates.
According to the AIC selection:
\[ \operatorname{\widehat{R}} = -618.5 + 1.13(\operatorname{Age}) + 1.82(\operatorname{Ed}) + 1.05(\operatorname{Ex0}) + 0.83(\operatorname{U2}) + 0.16(\operatorname{W}) + 0.82(\operatorname{X}) \]
Dependent variable: | |
R | |
Age | 1.125*** |
(0.351) | |
Ed | 1.818*** |
(0.480) | |
Ex0 | 1.051*** |
(0.175) | |
U2 | 0.828* |
(0.427) | |
W | 0.160* |
(0.094) | |
X | 0.824*** |
(0.181) | |
Constant | -618.503*** |
(108.246) | |
Observations | 47 |
R2 | 0.748 |
Adjusted R2 | 0.710 |
Residual Std. Error | 20.827 (df = 40) |
F Statistic | 19.771*** (df = 6; 40) |
Note: | p<0.1; p<0.05; p<0.01 |
The model shows that the variables Age, Ed, Ex0, U2, W, and X explain 75% of the variability in the crime rates.
4 Further readings
STAT 501, Pennsylvania State University