Multiple Linear Regression

Numecial Vs. multiple numerical

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objective

  1. Carry out multiple linear regression with R functions.
  2. Visualization of the multiple linear regression model.
  3. Step-by-step model simplification.

2 Principle

2.1 Definition

Linear Regression:
A statistical method to determine the independent quantitative relationship between one or more predictor variables and one dependent variable.
Simple Linear Regression:
only one predictor variable.
Multiple linear regression:
Two or more predictor variables.

Selection of predictor variables:

  1. The predictor variables must have a significant influence on the dependent variable and show a close linear correlation;
  2. The predictor variables should be mutually exclusive. The degree of correlation between independent variables should not be higher than the degree of correlation between independent variables and dependent variables;
  3. The predictor variables should have complete statistical data, and their predicted values can be easily determined.

2.2 Model

\[y_i= \beta_0 + \beta_1X_{i1} + ... + \beta_kX_{ik} + \varepsilon_i \]

  • X: predictor variable
  • y: dependent variable
  • \(\beta_k\): regression coefficient.
  • \(\epsilon\): effect of other random factors.

Example: House price: location, area, surrounding facilities.

\[H= \beta_0 + \beta_1 L + \beta_2A + \beta_3S + \varepsilon \]

2.3 Workflow

  1. Fit the model
  2. Diagnose the model
  3. Simplify the model

3 Example

3.1 Objective

Use 13 explanatory variables to predict the crime rate1:

  • 1 A Course in Statistics with R. Chapter 12

    • DV: Crime rate
    • IV: 13 explanatory variables
    data(usc, package = "ACSWR")

    3.2 Visualization

    pairs(usc)
    GGally::ggpairs(usc)
    ucor <- cor(usc)
    ucor
    corrplot::corrplot(ucor, order = 'AOE', type = "upper", method = "number")
    corrplot::corrplot(ucor, order = "AOE", type = "lower", method = "ell", 
                       diag = FALSE, tl.pos = "n", cl.pos = "n", add = TRUE)
    • A strong relationship between Ex0 and Ex1: multicollinearity.

    3.3 Fit the model

    crime_rate_lm <- lm(R ~ Age + S + Ed + Ex0 + Ex1 + LF + M + N + NW + U1 + U2 +
                          W + X, data = usc)
    # or
    crime_rate_lm <- lm(R ~ ., data = usc)
    summary(crime_rate_lm)
    confint(crime_rate_lm)
    anova(crime_rate_lm)
    • The intercept terms, Age, ED, U2, and X are significant variables to explain the crime rate. The 95% confidence intervals also confirm it.
    • The model is significant: \(p < 0.05\), adjusted \(R^2 = 0.6783\).

    The difference between \(R^2\) and Adj-\(R^2\):

    get_R2 <- function(x, method){
      round(summary(lm(usc$R ~ as.matrix(usc[, 2:x])))[[method]], 3)
    }
    dtf_R2 <- data.frame(n = 1:13,
                         R2 = sapply(2:14, get_R2, method = 'r.squared'),
                         AdjR2 = sapply(2:14, get_R2, method = 'adj.r.squared'))
    
    library(ggplot2)
    library(tidyr)
    dtf_R2 |> 
      pivot_longer(-n) |> 
      ggplot() +
      geom_point(aes(n, value, col = name)) +
      geom_line(aes(n, value, col = name))

    3.4 Diagnose the model

    Multicollinearity

    • multi-: multiple columns
    • col-: relationship
    • linearity: linear

    Problems:

    1. Imprecise estimates of \(\beta\)
    2. The t-tests may fail to reveal significant factors
    3. Missing importance of predictors
    # SLR
    data(Euphorbiaceae, package = 'gpk')
    dtf <- data.frame(x1 = Euphorbiaceae$GBH,
                      y = Euphorbiaceae$Height)
    plot(dtf)
    lm(y ~ x1, data = dtf) |> summary()
    # MLR
    dtf$x2 <- jitter(dtf$x)
    pairs(dtf)
    lm(y ~ x1 + x2, data = dtf) |> summary()
    # or
    lm(y ~ ., data = dtf) |> summary()
    cor(usc$Ex0,usc$Ex1)

    Solution: Variance inflation factor (VIF)

    \[\mathrm{VIF}_j = \frac{1}{1-R_j^2}\]

    • \(\mathrm{VIF}_j\): VIF for the \(j\)-th variable
    • \(R^2_j\): \(R^2\) from the regression of the \(j\)-th explanatory variable on the remaining explanatory variables.
    uscrimewor <- usc[, -1]
    faraway::vif(uscrimewor)

    Rule of thumb: VIF > 10 (\(R^2 > 0.9\)).

    Drop the variable with highest VIF:

    uscrimewor2 <- uscrimewor[, c('Age','S','Ed','Ex0','LF','M','N','NW','U1','U2','W','X')]

    Update the VIFs and the model:

    faraway::vif(uscrimewor2)
    crime_rate_lm2 <- lm(R ~ Age + S + Ed + Ex0 + LF + M + N + NW + U1 + U2 + W + X, usc)
    summary(crime_rate_lm2)
    • All updated VIFs < 10. No multicollinearity.
    • The 12 explanatory variables account for 76.7% of the variability in the crime rates of the 47 states.

    3.5 Simplify the model

    3.5.1 Backward selection

    1. Start with all the predictors in the model.
    2. Remove the predictor with highest p-value greater than \(\alpha\) (= 0.05 usually).
    3. Refit the model and go to 2.
    4. Stop when all p-values are less than \(\alpha\).
    get_p <- function(model){
      smry2 <- data.frame(summary(model)$coefficients)
      smry2[order(-smry2$Pr...t..), ]
    }
    get_p(crime_rate_lm2)
    crime_rate_lm3 <- update(crime_rate_lm2, . ~ . - NW)
    get_p(crime_rate_lm3)
    crime_rate_lm4 <- update(crime_rate_lm3, . ~ . - LF)
    get_p(crime_rate_lm4)
    crime_rate_lm5 <- update(crime_rate_lm4,.~.-N)
    get_p(crime_rate_lm5)
    crime_rate_lm6 <- update(crime_rate_lm5,.~.-S)
    get_p(crime_rate_lm6)
    crime_rate_lm7 <- update(crime_rate_lm6,.~.-M)
    get_p(crime_rate_lm7)
    crime_rate_lm8 <- update(crime_rate_lm7,.~.-U1)
    get_p(crime_rate_lm8)
    crime_rate_lm9 <- update(crime_rate_lm8,.~.-W)
    get_p(crime_rate_lm9)
    summary(crime_rate_lm9)

    3.5.2 Forward selection

    1. Start with no variables in the model.
    2. For all predictors not in the model, check their p-value if they are added to the model. Choose the one with lowest p-value less than alpha critical.
    3. Continue until no new predictors can be added.

    3.5.3 AIC selection

    Akaike Information Criteria (AIC): big = bad

    crime_rate_aic <- step(crime_rate_lm, direction = "both")
    summary(crime_rate_aic)

    3.6 Results

    According to the backward selection:

    \[ \operatorname{\widehat{R}} = -524.37 + 1.02(\operatorname{Age}) + 2.03(\operatorname{Ed}) + 1.23(\operatorname{Ex0}) + 0.91(\operatorname{U2}) + 0.63(\operatorname{X}) \]

    Dependent variable:
    R
    Age 1.020***
    (0.353)
    Ed 2.031***
    (0.474)
    Ex0 1.233***
    (0.142)
    U2 0.914**
    (0.434)
    X 0.635***
    (0.147)
    Constant -524.374***
    (95.116)
    Observations 47
    R2 0.730
    Adjusted R2 0.697
    Residual Std. Error 21.301 (df = 41)
    F Statistic 22.129*** (df = 5; 41)
    Note: p<0.1; p<0.05; p<0.01

    The model shows that the variables Age, Ed, Ex0, U2, and X explain 73% of the variability in the crime rates.

    According to the AIC selection:

    \[ \operatorname{\widehat{R}} = -618.5 + 1.13(\operatorname{Age}) + 1.82(\operatorname{Ed}) + 1.05(\operatorname{Ex0}) + 0.83(\operatorname{U2}) + 0.16(\operatorname{W}) + 0.82(\operatorname{X}) \]

    Dependent variable:
    R
    Age 1.125***
    (0.351)
    Ed 1.818***
    (0.480)
    Ex0 1.051***
    (0.175)
    U2 0.828*
    (0.427)
    W 0.160*
    (0.094)
    X 0.824***
    (0.181)
    Constant -618.503***
    (108.246)
    Observations 47
    R2 0.748
    Adjusted R2 0.710
    Residual Std. Error 20.827 (df = 40)
    F Statistic 19.771*** (df = 6; 40)
    Note: p<0.1; p<0.05; p<0.01

    The model shows that the variables Age, Ed, Ex0, U2, W, and X explain 75% of the variability in the crime rates.

    4 Further readings