library(ggplot2)
<- iris[iris$Species %in% c('versicolor', 'setosa'), c('Species', 'Sepal.Length', 'Petal.Length')]
iris2 $SpeciesLevel <- as.integer(iris2$Species) - 1 # 0: setosa. 1:versicolor
iris2ggplot(iris2) +
geom_point(aes(Sepal.Length, SpeciesLevel), alpha = 0.2, size = 6)
Logistic Regression
Categorical Vs. numerical
Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objectives
- What the logistic regression model is used for.
- Carry out the logistic regression with R functions.
- Interpret and visualize the model output.
2 Limitation of linear model
Example: iris data
Prepare the data:
According to the sepal length of an iris flower, can we guess what species it is?
ggplot(iris2) +
geom_point(aes(Sepal.Length, SpeciesLevel), alpha = 0.2, size = 6) +
geom_smooth(aes(Sepal.Length, SpeciesLevel), method = 'lm')
<- lm(SpeciesLevel ~ Sepal.Length, data = iris2)
lm_iris2 summary(lm_iris2)
However…
par(mfrow = c(2, 2))
plot(lm_iris2)
Problem:
- Out of the range of possible y
- The residuals-vs-predicted plot idicates two parallel lines, which is very unlikely if a linear model is appropriate.
How to solve them?
$SepalBin <- round(iris2$Sepal.Length * 4) / 4
iris2<- table(iris2$SepalBin, iris2$SpeciesLevel)
iris_tbl <- data.frame(cbind(iris_tbl))
iris_p $SepalBin <- as.numeric(row.names(iris_p))
iris_p$p <- iris_p$X1 / (iris_p$X0 + iris_p$X1) # Proportion of versicolor iris_p
ggplot(iris_p) +
geom_point(aes(SepalBin, p))
How can we describe this relationship in a model?
- Sigmoid curve:
\[\frac{\ 1}{\ 1 +e^{-x}}\]
curve(1 / (1 + exp(-x)), xlim = c(-10, 10))
abline(h = 0:1, col = "royalblue4",lty = 2)
3 Binary logistic regression
3.1 Principle
3.1.1 Definition
- (Binary) Logistic regression:
- Use a logistic function to model a binary dependent variable (1/0, yes/no, true/false, win/lose, male/female) against one or multiple qualitative variables.
Examples:
- Does someone have COVID-19?
- Does someone like AVATAR 2?
- Will someone vote for a certain student president?
3.1.2 Model
\[log \frac{p(X)}{1-p(X)} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_pX_p\]
- \(X_i\): The \(i\)-th predictor variable
- \(\beta_i\): The coefficient estimate for the \(i\)-th predictor variable
\[p(X) = \frac{e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p}}{\ (1 + e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p})}\]
3.2 Workflow
3.2.1 Fit the model
<- glm(SpeciesLevel ~ Sepal.Length, data = iris2, family = binomial)
lg_iris summary(lg_iris)
3.2.2 Assess the model
::pR2(lg_iris)["McFadden"] pscl
<- glm(SpeciesLevel ~ Sepal.Length + Petal.Length, data = iris2, family = binomial)
lg_iris2 ::varImp(lg_iris2) caret
The greater, the more important.
3.2.3 Results
Dependent variable: | |
SpeciesLevel | |
Sepal.Length | 5.140*** |
(1.007) | |
Constant | -27.831*** |
(5.434) | |
Observations | 100 |
Log Likelihood | -32.106 |
Akaike Inf. Crit. | 68.211 |
Note: | p<0.1; p<0.05; p<0.01 |
3.2.4 Visualization
ggplot(iris2, aes(x=Sepal.Length, y=SpeciesLevel)) +
geom_point() +
geom_smooth(method="glm", method.args = list(family=binomial)) +
labs(x = 'Sepal length (cm)', y = 'Species and p')
4 Multinomial and ordinal logistic regression
4.1 Definition
- Multinomial logistic regression:
- Use a logistic function to model a nominal variable with multiple levels against one or multiple qualitative variables.
- Ordinal logistic regression
- Use a logistic function to model a ordinal variable with multiple levels against one or multiple qualitative variables.
Examples:
- Which candidate will be the student president? (Vivi, Andy or Cocoa)
- What is someone’s favourite fruit? (Orange, watermelon, apple or others)
- The condition for someone with COVID-19. (no symptoms, mild or severe)
- The rank at which someone plays video games. (bronze, silver, gold, diamond or master)
4.2 Model
4.2.1 As a Model for Odds
\[\frac{P(Y{i}=j|X{i})}{P(Y{i}=J|X{i})}=exp[\alpha_{j}+\beta_{j}x_{i}]\]
- j: level of the dependent variable. 1, …, J
- \(P(Y{i}=j|X{i})\): The probability that individual i is in level j given a value of xi.
- The J-th* level: the baseline.
For example, assume there are three three insecticides (A,B,C), study their kill rate.
Step 1. choose one of insecticides as the baseline category.
A is chosen as the baseline category.
Step 2. Calculate the odds of the other two insecticides base on baseline category.
\[\frac{P(Y{i}=B|X{i})}{P(Y{i}=A|X{i})}=exp[\alpha_{1}+\beta_{1}x_{i}]\]
and
\[\frac{P(Y{i}=C|X{i})}{P(Y{i}=A|X{i})}=exp[\alpha_{2}+\beta_{2}x_{i}]\]
Step 3. The odds of other than the baseline category can be calculated based on step 2.
\[\begin{aligned} \frac{P(Y{i}=B|X{i})}{P(Y{i}=C|X{i})}=&\frac{exp[\alpha_{1}+\beta_{1}x_{i}]}{exp[\alpha_{2}+\beta_{2}x_{i}]}\\ = &exp[(\alpha_{1}-\alpha_{2})+(\beta_{1}-\beta_{1})x_{i}] \\= &exp[\alpha_{3}+\beta_{3}x_{i}]\\\end{aligned} \]
4.2.2 As a Model of Probabilities
\[{P(Y{i}=j|X{i})}=\frac{exp[\alpha_{j}+\beta_{j}x_{i}]}{\sum_{h=1}^{J}exp[\alpha_{h}+\beta_{h}x_{i}]},\quad where \,\, j = 1,...,J\]
4.3 Workflow
4.3.1 Prepare the data
The iris dataset. Can we predict the Species according to the sepal and petal lengths and widths?
- Dependent variable: Species (a categorical variable with 3 levels)
- Independent variables: Sepal and petal lengths and widths (4 numerical variables)?
Separate the dataset into training data (80%) and test data (20%) randomly.
set.seed(20221228)
<- sample(nrow(iris), round(0.8 * nrow(iris)))
train.sub <- iris[train.sub, ]
train.data <- iris[-train.sub, ] test.data
4.3.2 Fit the model
library(nnet)
<- multinom(Species ~., data = train.data) model
summary(model)
- Only two species displayed: setosa as the baseline level.
Equations:
\[O_{ve}=\frac{P(Y_i=versicolor)}{P(Y_i=setosa)} = exp(65.05608 - 11.15638 L_s - 20.66121 W_s + 19.55547 L_p - 2.623314 W_p)\]
\[O_{vi} = \frac{P(Y_i=virginica)}{P(Y_i=setosa)} = exp(-80.97686 -30.81517 L_s -23.41275 W_s + 62.11264 L_p + 40.254391 W_p)\]
4.3.3 Validate the model
Prediction:
1, ]
test.data[<- exp(65.05608 - 11.15638 * test.data$Sepal.Length[1] - 20.66121 * test.data$Sepal.Width[1] + 19.55547 * test.data$Petal.Length[1] - 2.623314 * test.data$Petal.Width[1])
o_ve <- exp(-80.97686 - 30.81517 * test.data$Sepal.Length[1] - 23.41275 * test.data$Sepal.Width[1] + 62.11264 * test.data$Petal.Length[1] + 40.254391 * test.data$Petal.Width[1])
o_vi <- o_ve / (o_ve + o_vi + 1)
p_ve <- o_vi / (o_ve + o_vi + 1)
p_vi <- 1 - p_ve - p_vi p_se
$predicted <- predict(model, newdata = test.data)
test.datamean(test.data$predicted == test.data$Species)
table(test.data$Species, test.data$predicted)
# a diagram showing the prediction result of the first 50 samples
<- data.frame(
predicted.data1 probability.of.Species = model$fitted.values,
Species=train.data$Species)
$n <- 1:nrow(predicted.data1)
predicted.data1
library(ggplot2)
ggplot(predicted.data1, aes(n,probability.of.Species.versicolor)) +
geom_point(aes(color=Species)) +
xlab("n") +
ylab("Predicted probability of versicolor")
- The p value: Wald test
- Determine the significance of odds ratios in logistic regression model and calculate confidence interval. H0: \(\beta = 0\) (the Wald statistic is an approximate standard normal distribution)
\[z_{score} =\frac{β}{s_e}\]
<- summary(model)$coefficients/summary(model)$standard.errors
z_score <- pnorm(abs(z_score), lower.tail = FALSE) * 2 p
Confidence intervals:
\[CI = \beta \pm{z_{(1-\alpha)/2}\cdot s_e}\]
summary(model)$coefficient - qnorm(0.975) * summary(model)$standard.errors
summary(model)$coefficient + qnorm(0.975) * summary(model)$standard.errors
confint(model)
4.3.4 Simply the model
<- step(model)
model2 summary(model2)
<- summary(model2)$coefficients/summary(model2)$standard.errors
z_score <- pnorm(abs(z_score), lower.tail = FALSE) * 2 p
4.3.5 Multicollinearity
<- train.data[, 1:4]
iriswor ::vif(iriswor) faraway
Rule of thumb: VIF > 10 (\(R^2 > 0.9\)).
<- iriswor[, c('Sepal.Length', 'Sepal.Width')] iriswor2
Update the VIFs and the model:
::vif(iriswor2)
faraway<- multinom(Species ~ Sepal.Length + Sepal.Width, data = train.data)
model3 summary(model3)
<- summary(model3)$coefficients/summary(model3)$standard.errors
z_score <- pnorm(abs(z_score), lower.tail = FALSE) * 2 p
4.3.6 Results
Dependent variable: | ||
versicolor | virginica | |
(1) | (2) | |
Sepal.Length | -11.156 | -30.815 |
(143.632) | (153.331) | |
Sepal.Width | -20.661 | -23.413 |
(220.048) | (220.005) | |
Petal.Length | 19.555 | 62.113 |
(51.635) | (87.708) | |
Petal.Width | -2.623 | 40.254 |
(207.199) | (213.999) | |
Constant | 65.056 | -80.977 |
(140.936) | (155.199) | |
Akaike Inf. Crit. | 20.114 | 20.114 |
Note: | p<0.1; p<0.05; p<0.01 |