Poisson Regression
Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objectives
In this lecture, you will learn:
1.Understand the definition and scope of Poisson regression
2.Learn to code Poisson regression using R
3.Study the application of Poisson regression model in real life with examples
2 Principle
2.1 Definition
- Poisson distribution:
-
A discrete probability distribution.
The probability of occurrence of an event in a fixed spatial or temporal scales.
Derived from the binomial distribution for an infinite number of binomial distributions and an infinitesimal probability of each occurrence.
PDF for Poisson distribution (\(\lambda\) instead of \(\mu\)):
\[P\left( X=k \right)=\frac{\lambda^{k}}{k!}e^{-\lambda}\]
- \(k=0,1,2,...\)
- \(\lambda >0\)
Features for Poisson distribution:
- \(\mu=\sigma^2=\lambda\)
- If \(Y_1\) ~Poisson (\(\lambda_1\)), \(Y_2\)~Poisson(\(\lambda_2\)), then \(Y=Y1+Y2\)~Poisson(\(\lambda=\lambda_1+\lambda_2\))
- With the increase of \(λ\), the center of the distribution moves to the right, and the Poisson distribution approaches the normal distribution.
- Poisson regression:
-
Named after the French mathematician Simeon-Denis Poisson in 1838.
Regression analysis of counting response variables or contingency tables.
Log-linear regression
Assume the dependent variable obeys a Poisson distribution
The logarithm of dependent variable is linearly related to the independent ones.
Assumptions:
- The occurrence of an event is completely random.
- The occurrence of an event is independent.
- The probability \(p\) of the occurrence of an event remains unchanged.
Examples:
- Count of cancers per million people
- Count of times a radioactive substance is emitted per unit time
- Count of people who run red lights in an hour
- Count of insects per unit area of land
2.2 Model
The general mathematical equation of Poisson regression is
\[log(Y)=a + b_1x_1 + b_2x_2 + b_nx_n.....\]
- \(Y\): response variable
- \(a\) and \(b\): numerical coefficients
- \(x\): predictive variable
3 Workflow
3.1 Data
Plant species in Galapagos:
library(faraway)
data(gala)
Count of the plant species ~ the geographic variables
3.2 Fit model
<-
fit.gala glm(
~ Endemics + Area + Elevation + Nearest + Scruz + Adjacent,
Species data = gala,
family = poisson()
)summary(fit.gala)
- Significant variables: Endemics, Area, and Nearest.
3.3 Simplify the model
<-
fit.gala.reduced glm(Species ~ Endemics + Area + Nearest,
data = gala,
family = poisson())
summary(fit.gala.reduced)
3.4 Results
exp(coef(fit.gala.reduced))
\[ \log ({ \widehat{E( \operatorname{Species} )} }) = 2.87 + 0.04(\operatorname{Endemics}) + 0(\operatorname{Area}) + 0.01(\operatorname{Nearest}) \]
Dependent variable: | |
Species | |
Endemics | 0.036*** |
(0.001) | |
Area | -0.00005*** |
(0.00002) | |
Nearest | 0.009*** |
(0.001) | |
Constant | 2.868*** |
(0.049) | |
Observations | 30 |
Log Likelihood | -245.834 |
Akaike Inf. Crit. | 499.668 |
Note: | p<0.1; p<0.05; p<0.01 |
3.5 Overdispersion test
library(qcc)
qcc.overdispersion.test(gala$Species, type = "poisson")
\(p\)value is smaller than 0.05, it has overdispersion.
<-
fit.gala.reduced.od glm(Species ~ Endemics + Area + Nearest,
data = gala,
family = quasipoisson())
summary(fit.gala.reduced.od)
Area has no significant, Endemics and Nearest have significant.