Logistic Regression

Categorical Vs. numerical

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objectives

What the logistic regression model is used for.
Carry out the logistic regression with R functions.
Interpret and visualize the model output.

2 Limitation of linear model

Example: iris data

Prepare the data:

library(ggplot2)
iris2 <- iris[iris$Species %in% c('versicolor', 'setosa'), c('Species', 'Sepal.Length', 'Petal.Length')]
iris2$SpeciesLevel <- as.integer(iris2$Species) - 1 # 0: setosa. 1:versicolor
ggplot(iris2) +
  geom_point(aes(Sepal.Length, SpeciesLevel), alpha = 0.2, size = 6)

According to the sepal length of an iris flower, can we guess what species it is?

ggplot(iris2) +
  geom_point(aes(Sepal.Length, SpeciesLevel), alpha = 0.2, size = 6) +
  geom_smooth(aes(Sepal.Length, SpeciesLevel), method = 'lm')

lm_iris2 <- lm(SpeciesLevel ~ Sepal.Length, data = iris2)
summary(lm_iris2)

However…

par(mfrow = c(2, 2))
plot(lm_iris2)

Problem:

Out of the range of possible y
The residuals-vs-predicted plot idicates two parallel lines, which is very unlikely if a linear model is appropriate.

How to solve them?

iris2$SepalBin <- round(iris2$Sepal.Length * 4) / 4
iris_tbl <- table(iris2$SepalBin, iris2$SpeciesLevel)
iris_p <- data.frame(cbind(iris_tbl))
iris_p$SepalBin <- as.numeric(row.names(iris_p))
iris_p$p <- iris_p$X1 / (iris_p$X0 + iris_p$X1) # Proportion of versicolor

ggplot(iris_p) + 
  geom_point(aes(SepalBin, p))

How can we describe this relationship in a model?

Sigmoid curve:

\[\frac{\ 1}{\ 1 +e^{-x}}\]

curve(1 / (1 + exp(-x)), xlim = c(-10, 10))
abline(h = 0:1, col = "royalblue4",lty = 2)

3 Binary logistic regression

3.1 Principle

3.1.1 Definition

(Binary) Logistic regression:: Use a logistic function to model a binary dependent variable (1/0, yes/no, true/false, win/lose, male/female) against one or multiple qualitative variables.

Examples:

Does someone have COVID-19?
Does someone like AVATAR 2?
Will someone vote for a certain student president?

3.1.2 Model

\[log \frac{p(X)}{1-p(X)} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_pX_p\]

\(X_i\): The \(i\)-th predictor variable
\(\beta_i\): The coefficient estimate for the \(i\)-th predictor variable

\[p(X) = \frac{e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p}}{\ (1 + e^{\beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p})}\]

3.2 Workflow

3.2.1 Fit the model

lg_iris <- glm(SpeciesLevel ~ Sepal.Length, data = iris2, family = binomial) 
summary(lg_iris)

3.2.2 Assess the model

pscl::pR2(lg_iris)["McFadden"]

lg_iris2 <- glm(SpeciesLevel ~ Sepal.Length + Petal.Length, data = iris2, family = binomial)
caret::varImp(lg_iris2)

The greater, the more important.

3.2.3 Results


	Dependent variable:

	SpeciesLevel

Sepal.Length	5.140^***
	(1.007)

Constant	-27.831^***
	(5.434)


Observations	100
Log Likelihood	-32.106
Akaike Inf. Crit.	68.211

Note:	p<0.1; p<0.05; p<0.01

3.2.4 Visualization

ggplot(iris2, aes(x=Sepal.Length, y=SpeciesLevel)) + 
  geom_point() +
  geom_smooth(method="glm", method.args = list(family=binomial)) +
  labs(x = 'Sepal length (cm)', y = 'Species and p')

4 Multinomial and ordinal logistic regression

4.1 Definition

Multinomial logistic regression:: Use a logistic function to model a nominal variable with multiple levels against one or multiple qualitative variables.
Ordinal logistic regression: Use a logistic function to model a ordinal variable with multiple levels against one or multiple qualitative variables.

Examples:

Which candidate will be the student president? (Vivi, Andy or Cocoa)
What is someone’s favourite fruit? (Orange, watermelon, apple or others)
The condition for someone with COVID-19. (no symptoms, mild or severe)
The rank at which someone plays video games. (bronze, silver, gold, diamond or master)

4.2 Model

4.2.1 As a Model for Odds

\[\frac{P(Y{i}=j|X{i})}{P(Y{i}=J|X{i})}=exp[\alpha_{j}+\beta_{j}x_{i}]\]

j: level of the dependent variable. 1, …, J
\(P(Y{i}=j|X{i})\): The probability that individual i is in level j given a value of x_i.
The J-th* level: the baseline.

For example, assume there are three three insecticides (A,B,C), study their kill rate.

Step 1. choose one of insecticides as the baseline category.
A is chosen as the baseline category.

Step 2. Calculate the odds of the other two insecticides base on baseline category.

\[\frac{P(Y{i}=B|X{i})}{P(Y{i}=A|X{i})}=exp[\alpha_{1}+\beta_{1}x_{i}]\]

and

\[\frac{P(Y{i}=C|X{i})}{P(Y{i}=A|X{i})}=exp[\alpha_{2}+\beta_{2}x_{i}]\]

Step 3. The odds of other than the baseline category can be calculated based on step 2.

\[\begin{aligned} \frac{P(Y{i}=B|X{i})}{P(Y{i}=C|X{i})}=&\frac{exp[\alpha_{1}+\beta_{1}x_{i}]}{exp[\alpha_{2}+\beta_{2}x_{i}]}\\ = &exp[(\alpha_{1}-\alpha_{2})+(\beta_{1}-\beta_{1})x_{i}] \\= &exp[\alpha_{3}+\beta_{3}x_{i}]\\\end{aligned} \]

4.2.2 As a Model of Probabilities

\[{P(Y{i}=j|X{i})}=\frac{exp[\alpha_{j}+\beta_{j}x_{i}]}{\sum_{h=1}^{J}exp[\alpha_{h}+\beta_{h}x_{i}]},\quad where \,\, j = 1,...,J\]

4.3 Workflow

4.3.1 Prepare the data

The iris dataset. Can we predict the Species according to the sepal and petal lengths and widths?

Dependent variable: Species (a categorical variable with 3 levels)
Independent variables: Sepal and petal lengths and widths (4 numerical variables)?

Separate the dataset into training data (80%) and test data (20%) randomly.

set.seed(20221228)

train.sub <- sample(nrow(iris), round(0.8 * nrow(iris)))
train.data <- iris[train.sub, ]
test.data <- iris[-train.sub, ]

4.3.2 Fit the model

library(nnet)
model <- multinom(Species ~., data = train.data)

summary(model)

Only two species displayed: setosa as the baseline level.

Equations:

\[O_{ve}=\frac{P(Y_i=versicolor)}{P(Y_i=setosa)} = exp(65.05608 - 11.15638 L_s - 20.66121 W_s + 19.55547 L_p - 2.623314 W_p)\]

\[O_{vi} = \frac{P(Y_i=virginica)}{P(Y_i=setosa)} = exp(-80.97686 -30.81517 L_s -23.41275 W_s + 62.11264 L_p + 40.254391 W_p)\]

4.3.3 Validate the model

Prediction:

test.data[1, ]
o_ve <- exp(65.05608 - 11.15638 * test.data$Sepal.Length[1] - 20.66121 * test.data$Sepal.Width[1] + 19.55547 * test.data$Petal.Length[1] - 2.623314 * test.data$Petal.Width[1])
o_vi <- exp(-80.97686 - 30.81517 * test.data$Sepal.Length[1] - 23.41275 * test.data$Sepal.Width[1] + 62.11264 * test.data$Petal.Length[1] + 40.254391 * test.data$Petal.Width[1])
p_ve <- o_ve / (o_ve + o_vi + 1)
p_vi <- o_vi / (o_ve + o_vi + 1)
p_se <- 1 - p_ve - p_vi

test.data$predicted <- predict(model, newdata = test.data)
mean(test.data$predicted == test.data$Species)
table(test.data$Species, test.data$predicted)

# a diagram showing the prediction result of the first 50 samples
predicted.data1 <- data.frame(
  probability.of.Species = model$fitted.values,
  Species=train.data$Species)
predicted.data1$n <- 1:nrow(predicted.data1)

library(ggplot2)

ggplot(predicted.data1, aes(n,probability.of.Species.versicolor)) +
  geom_point(aes(color=Species)) +
  xlab("n") +
  ylab("Predicted probability of versicolor")

The p value: Wald test: Determine the significance of odds ratios in logistic regression model and calculate confidence interval. H₀: \(\beta = 0\) (the Wald statistic is an approximate standard normal distribution)

\[z_{score} =\frac{β}{s_e}\]

z_score <- summary(model)$coefficients/summary(model)$standard.errors
p <- pnorm(abs(z_score), lower.tail = FALSE) * 2

Confidence intervals:

\[CI = \beta \pm{z_{(1-\alpha)/2}\cdot s_e}\]

summary(model)$coefficient - qnorm(0.975) *  summary(model)$standard.errors
summary(model)$coefficient + qnorm(0.975) *  summary(model)$standard.errors

confint(model)

4.3.4 Simply the model

model2 <- step(model)
summary(model2)
z_score <- summary(model2)$coefficients/summary(model2)$standard.errors
p <- pnorm(abs(z_score), lower.tail = FALSE) * 2

4.3.5 Multicollinearity

iriswor <- train.data[, 1:4]
faraway::vif(iriswor)

Rule of thumb: VIF > 10 (\(R^2 > 0.9\)).

iriswor2 <- iriswor[, c('Sepal.Length', 'Sepal.Width')]

Update the VIFs and the model:

faraway::vif(iriswor2)
model3 <- multinom(Species ~ Sepal.Length + Sepal.Width, data = train.data)
summary(model3)
z_score <- summary(model3)$coefficients/summary(model3)$standard.errors
p <- pnorm(abs(z_score), lower.tail = FALSE) * 2

4.3.6 Results


	Dependent variable:

	versicolor	virginica
	(1)	(2)

Sepal.Length	-11.156	-30.815
	(143.632)	(153.331)

Sepal.Width	-20.661	-23.413
	(220.048)	(220.005)

Petal.Length	19.555	62.113
	(51.635)	(87.708)

Petal.Width	-2.623	40.254
	(207.199)	(213.999)

Constant	65.056	-80.977
	(140.936)	(155.199)


Akaike Inf. Crit.	20.114	20.114

Note:	p<0.1; p<0.05; p<0.01

5 Further readings

R Tutorials on TechVidvan:
- R Logistic Regression