Poisson Regression

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objectives

In this lecture, you will learn:

1.Understand the definition and scope of Poisson regression

2.Learn to code Poisson regression using R

3.Study the application of Poisson regression model in real life with examples

2 Principle

2.1 Definition

Poisson distribution:

A discrete probability distribution.

The probability of occurrence of an event in a fixed spatial or temporal scales.

Derived from the binomial distribution for an infinite number of binomial distributions and an infinitesimal probability of each occurrence.

PDF for Poisson distribution (\(\lambda\) instead of \(\mu\)):

\[P\left( X=k \right)=\frac{\lambda^{k}}{k!}e^{-\lambda}\]

  • \(k=0,1,2,...\)
  • \(\lambda >0\)

Features for Poisson distribution:

  • \(\mu=\sigma^2=\lambda\)
  • If \(Y_1\) ~Poisson (\(\lambda_1\)), \(Y_2\)~Poisson(\(\lambda_2\)), then \(Y=Y1+Y2\)~Poisson(\(\lambda=\lambda_1+\lambda_2\))
  • With the increase of \(λ\), the center of the distribution moves to the right, and the Poisson distribution approaches the normal distribution.
Poisson regression:

Named after the French mathematician Simeon-Denis Poisson in 1838.

Regression analysis of counting response variables or contingency tables.

Log-linear regression

Assume the dependent variable obeys a Poisson distribution

The logarithm of dependent variable is linearly related to the independent ones.

Assumptions:

  • The occurrence of an event is completely random.
  • The occurrence of an event is independent.
  • The probability \(p\) of the occurrence of an event remains unchanged.

Examples:

  • Count of cancers per million people
  • Count of times a radioactive substance is emitted per unit time
  • Count of people who run red lights in an hour
  • Count of insects per unit area of land

2.2 Model

The general mathematical equation of Poisson regression is

\[log(Y)=a + b_1x_1 + b_2x_2 + b_nx_n.....\]

  • \(Y\): response variable
  • \(a\) and \(b\): numerical coefficients
  • \(x\): predictive variable

3 Workflow

3.1 Data

Plant species in Galapagos:

library(faraway)
data(gala)

Count of the plant species ~ the geographic variables

3.2 Fit model

fit.gala <-
  glm(
    Species ~ Endemics + Area + Elevation + Nearest + Scruz + Adjacent,
    data = gala,
    family = poisson()
  )
summary(fit.gala)
  • Significant variables: Endemics, Area, and Nearest.

3.3 Simplify the model

fit.gala.reduced <-
  glm(Species ~ Endemics + Area + Nearest,
      data = gala,
      family = poisson())
summary(fit.gala.reduced)

3.4 Results

exp(coef(fit.gala.reduced))

\[ \log ({ \widehat{E( \operatorname{Species} )} }) = 2.87 + 0.04(\operatorname{Endemics}) + 0(\operatorname{Area}) + 0.01(\operatorname{Nearest}) \]

Dependent variable:
Species
Endemics 0.036***
(0.001)
Area -0.00005***
(0.00002)
Nearest 0.009***
(0.001)
Constant 2.868***
(0.049)
Observations 30
Log Likelihood -245.834
Akaike Inf. Crit. 499.668
Note: p<0.1; p<0.05; p<0.01

3.5 Overdispersion test

library(qcc)
qcc.overdispersion.test(gala$Species, type = "poisson")

\(p\)value is smaller than 0.05, it has overdispersion.

fit.gala.reduced.od <-
  glm(Species ~ Endemics + Area + Nearest,
      data = gala,
      family = quasipoisson())
summary(fit.gala.reduced.od)

Area has no significant, Endemics and Nearest have significant.

4 Further readings