Multiple Linear Regression

Numecial Vs. multiple numerical

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objective

Carry out multiple linear regression with R functions.
Visualization of the multiple linear regression model.
Step-by-step model simplification.

2 Principle

2.1 Definition

Linear Regression:: A statistical method to determine the independent quantitative relationship between one or more predictor variables and one dependent variable.
Simple Linear Regression:: only one predictor variable.
Multiple linear regression:: Two or more predictor variables.

Selection of predictor variables:

The predictor variables must have a significant influence on the dependent variable and show a close linear correlation;
The predictor variables should be mutually exclusive. The degree of correlation between independent variables should not be higher than the degree of correlation between independent variables and dependent variables;
The predictor variables should have complete statistical data, and their predicted values can be easily determined.

2.2 Model

\[y_i= \beta_0 + \beta_1X_{i1} + ... + \beta_kX_{ik} + \varepsilon_i \]

X: predictor variable
y: dependent variable
\(\beta_k\): regression coefficient.
\(\epsilon\): effect of other random factors.

Example: House price: location, area, surrounding facilities.

\[H= \beta_0 + \beta_1 L + \beta_2A + \beta_3S + \varepsilon \]

2.3 Workflow

Fit the model
Diagnose the model
Simplify the model

3 Example

3.1 Objective

Use 13 explanatory variables to predict the crime rate¹:

¹ A Course in Statistics with R. Chapter 12

DV: Crime rate
IV: 13 explanatory variables

data(usc, package = "ACSWR")

3.2 Visualization

pairs(usc)
GGally::ggpairs(usc)

ucor <- cor(usc)
ucor
corrplot::corrplot(ucor, order = 'AOE', type = "upper", method = "number")
corrplot::corrplot(ucor, order = "AOE", type = "lower", method = "ell", 
                   diag = FALSE, tl.pos = "n", cl.pos = "n", add = TRUE)

A strong relationship between Ex0 and Ex1: multicollinearity.

3.3 Fit the model

crime_rate_lm <- lm(R ~ Age + S + Ed + Ex0 + Ex1 + LF + M + N + NW + U1 + U2 +
                      W + X, data = usc)
# or
crime_rate_lm <- lm(R ~ ., data = usc)
summary(crime_rate_lm)
confint(crime_rate_lm)
anova(crime_rate_lm)

The intercept terms, Age, ED, U2, and X are significant variables to explain the crime rate. The 95% confidence intervals also confirm it.
The model is significant: \(p < 0.05\), adjusted \(R^2 = 0.6783\).

The difference between \(R^2\) and Adj-\(R^2\):

get_R2 <- function(x, method){
  round(summary(lm(usc$R ~ as.matrix(usc[, 2:x])))[[method]], 3)
}
dtf_R2 <- data.frame(n = 1:13,
                     R2 = sapply(2:14, get_R2, method = 'r.squared'),
                     AdjR2 = sapply(2:14, get_R2, method = 'adj.r.squared'))

library(ggplot2)
library(tidyr)
dtf_R2 |> 
  pivot_longer(-n) |> 
  ggplot() +
  geom_point(aes(n, value, col = name)) +
  geom_line(aes(n, value, col = name))

3.4 Diagnose the model

Multicollinearity

multi-: multiple columns
col-: relationship
linearity: linear

Problems:

Imprecise estimates of \(\beta\)
The t-tests may fail to reveal significant factors
Missing importance of predictors

# SLR
data(Euphorbiaceae, package = 'gpk')
dtf <- data.frame(x1 = Euphorbiaceae$GBH,
                  y = Euphorbiaceae$Height)
plot(dtf)

lm(y ~ x1, data = dtf) |> summary()

# MLR
dtf$x2 <- jitter(dtf$x)
pairs(dtf)

lm(y ~ x1 + x2, data = dtf) |> summary()
# or
lm(y ~ ., data = dtf) |> summary()

cor(usc$Ex0,usc$Ex1)

Solution: Variance inflation factor (VIF)

\[\mathrm{VIF}_j = \frac{1}{1-R_j^2}\]

\(\mathrm{VIF}_j\): VIF for the \(j\)-th variable
\(R^2_j\): \(R^2\) from the regression of the \(j\)-th explanatory variable on the remaining explanatory variables.

uscrimewor <- usc[, -1]
faraway::vif(uscrimewor)

Rule of thumb: VIF > 10 (\(R^2 > 0.9\)).

Drop the variable with highest VIF:

uscrimewor2 <- uscrimewor[, c('Age','S','Ed','Ex0','LF','M','N','NW','U1','U2','W','X')]

Update the VIFs and the model:

faraway::vif(uscrimewor2)
crime_rate_lm2 <- lm(R ~ Age + S + Ed + Ex0 + LF + M + N + NW + U1 + U2 + W + X, usc)
summary(crime_rate_lm2)

All updated VIFs < 10. No multicollinearity.
The 12 explanatory variables account for 76.7% of the variability in the crime rates of the 47 states.

3.5 Simplify the model

3.5.1 Backward selection

Start with all the predictors in the model.
Remove the predictor with highest p-value greater than \(\alpha\) (= 0.05 usually).
Refit the model and go to 2.
Stop when all p-values are less than \(\alpha\).

get_p <- function(model){
  smry2 <- data.frame(summary(model)$coefficients)
  smry2[order(-smry2$Pr...t..), ]
}
get_p(crime_rate_lm2)
crime_rate_lm3 <- update(crime_rate_lm2, . ~ . - NW)
get_p(crime_rate_lm3)
crime_rate_lm4 <- update(crime_rate_lm3, . ~ . - LF)
get_p(crime_rate_lm4)
crime_rate_lm5 <- update(crime_rate_lm4,.~.-N)
get_p(crime_rate_lm5)
crime_rate_lm6 <- update(crime_rate_lm5,.~.-S)
get_p(crime_rate_lm6)
crime_rate_lm7 <- update(crime_rate_lm6,.~.-M)
get_p(crime_rate_lm7)
crime_rate_lm8 <- update(crime_rate_lm7,.~.-U1)
get_p(crime_rate_lm8)
crime_rate_lm9 <- update(crime_rate_lm8,.~.-W)
get_p(crime_rate_lm9)
summary(crime_rate_lm9)

3.5.2 Forward selection

Start with no variables in the model.
For all predictors not in the model, check their p-value if they are added to the model. Choose the one with lowest p-value less than alpha critical.
Continue until no new predictors can be added.

3.5.3 AIC selection

Akaike Information Criteria (AIC): big = bad

crime_rate_aic <- step(crime_rate_lm, direction = "both")
summary(crime_rate_aic)

3.6 Results

According to the backward selection:

\[ \operatorname{\widehat{R}} = -524.37 + 1.02(\operatorname{Age}) + 2.03(\operatorname{Ed}) + 1.23(\operatorname{Ex0}) + 0.91(\operatorname{U2}) + 0.63(\operatorname{X}) \]


	Dependent variable:

	R

Age	1.020^***
	(0.353)

Ed	2.031^***
	(0.474)

Ex0	1.233^***
	(0.142)

U2	0.914^**
	(0.434)

X	0.635^***
	(0.147)

Constant	-524.374^***
	(95.116)


Observations	47
R²	0.730
Adjusted R²	0.697
Residual Std. Error	21.301 (df = 41)
F Statistic	22.129^*** (df = 5; 41)

Note:	p<0.1; p<0.05; p<0.01

The model shows that the variables Age, Ed, Ex0, U2, and X explain 73% of the variability in the crime rates.

According to the AIC selection:

\[ \operatorname{\widehat{R}} = -618.5 + 1.13(\operatorname{Age}) + 1.82(\operatorname{Ed}) + 1.05(\operatorname{Ex0}) + 0.83(\operatorname{U2}) + 0.16(\operatorname{W}) + 0.82(\operatorname{X}) \]


	Dependent variable:

	R

Age	1.125^***
	(0.351)

Ed	1.818^***
	(0.480)

Ex0	1.051^***
	(0.175)

U2	0.828^*
	(0.427)

W	0.160^*
	(0.094)

X	0.824^***
	(0.181)

Constant	-618.503^***
	(108.246)


Observations	47
R²	0.748
Adjusted R²	0.710
Residual Std. Error	20.827 (df = 40)
F Statistic	19.771^*** (df = 6; 40)

Note:	p<0.1; p<0.05; p<0.01

The model shows that the variables Age, Ed, Ex0, U2, W, and X explain 75% of the variability in the crime rates.

4 Further readings

R Tutorials on TechVidvan:
- Linear Regression in R
STAT 501, Pennsylvania State University