Tests for distributions

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objective

  1. Understand the relationship between the R distribution functions.
  2. Understand the Kolmogorov-Smirnov distribution and the Kolmogorov-Smirnov test statistic.
  3. Use the Kolmogorov-Smirnov test for one-sample and two-sample hypothesis tests.

2 Distributions and their R functions

Distribution p- q- d- r-
Beta pbeta qbeta dbeta rbeta
Binomial pbinom qbinom dbinom rbinom
Cauchy pcauchy qcauchy dcauchy rcauchy
Chi-Square pchisq qchisq dchisq rchisq
Exponential) pexp qexp dexp rexp
F pf qf df rf
Gamma pgamma qgamma dgamma rgamma
Geometric pgeom qgeom dgeom rgeom
Hypergeometrictm phyper qhyper dhyper rhyper
Logistic plogis qlogis dlogis rlogis
Log Normal plnorm qlnorm dlnorm rlnorm
Negative Binomial pnbinom qnbinom dnbinom rnbinom
Normal pnorm qnorm dnorm rnorm
Poisson ppois qpois dpois rpois
Student t pt qt dt rt
Studentized Range ptukey qtukey dtukey rtukey
Uniform punif qunif dunif runif
Weibull pweibull qweibull dweibull rweibull
Wilcoxon Rank Sum Statistic pwilcox qwilcox dwilcox rwilcox
Wilcoxon Signed Rank Statistic psignrank qsignrank dsignrank rsignrank

Distribution functions in R: prefix + root_name

  • d for “density”, the density function (PDF).
  • p for “probability”, the cumulative distribution function (CDF).
  • q for “quantile” or “critical”, the inverse of CDF.
  • r for “random”, a random variable having the specified distribution

Distribution Parameters Expression Mean Variance
Bernoulli trial \(p\) \(p\) \(p\) \(pq\)
Binomial \(n, p\) \(C(n, x) p^x q^{n-x}\) \(np\) \(npq\)
Poisson \(\lambda\) \(e^{-\lambda} \lambda^x / x !\) \(\lambda\) \(\lambda\)
Normal \(\mu, \sigma\) \(\frac{1}{\sqrt{2 \pi \sigma}} e^{-(x-\mu)^2 / 2 \sigma^2}\) \(\mu\) \(\sigma^2\)
Std. Normal \(z\) \(\frac{1}{\sqrt{2 \pi}} e^{-z^2 / 2}\) 0 1
t \(v\) \(\frac{\Gamma\left(\frac{v+1}{2}\right)}{\sqrt{v \pi} \Gamma\left(\frac{v}{2}\right)}\left(1+\frac{x^2}{v}\right)^{-\frac{v+1}{2}}\) 0 \(\left\{\begin{array}{cc}v /(v-2) & \text { for } v>2 \\ \infty & \text { for } 1<v \leq 2 \\ \text { undefined } & \text { otherwise }\end{array}\right.\)

3 Empirical distribution function (EDF or eCDF)

3.1 Definition

Given an observed random sample \({X_{1}}\), \({X_{2}}\), …, \({X_{n}}\), an empirical distribution function \({F_{n}}(X)\) is the fraction of sample observations less than or equal to the value x. More specifically, if \({y_{1}}\)<\({y_{2}}\)<..<\({y_{n}}\) are the order statistics of the observed random sample, with no two observations being equal, then the empirical distribution function is defined as:

\(F_{n}(x)= \begin{cases}0, & \text { for } x<y_{1} \\ k / n, & \text { for } y_{k} \leqslant x<y_{k+1}, k=1,2, \ldots, n-1 \\ 1, & \text { for } x \geqslant y_{n}\end{cases}\)

Commonly also called empirical Cumulative Distribution Function.

3.2 Example: normal distribution

Code
par(mfrow = c(2, 2))
x1 <- rnorm(10)
x2 <- rnorm(20)
x3 <- rnorm(50)
x4 <- rnorm(100)
plot(ecdf(x1))
curve(pnorm, col = 'red', add = TRUE)
plot(ecdf(x2))
curve(pnorm, col = 'red', add = TRUE)
plot(ecdf(x3))
curve(pnorm, col = 'red', add = TRUE)
plot(ecdf(x4))
curve(pnorm, col = 'red', add = TRUE)

4 Kolmogorov-Smirnov test

4.1 Principle

Kolmogorov-Smirnov distribution

Code
# Install the kolmin package locally.
library(kolmim)
pvalue <- pkolm(d = 0.1, n = 30)
nks <- c(10, 20, 30, 50, 80, 100)
x <- seq(0.01, 1, 0.01)
y <- sapply(x, pkolm, n = nks[1])
plot(x, y, type = 'l', las = 1, xlab = 'D', ylab = 'CDF')
for (i in 2:length(nks)) lines(x, sapply(x, pkolm, n = nks[i]), col = i)

critical values for D for one sample (\(n\le40\)):

Table of one-sample Kolmogorov-Smirnov test statistics D

Asymptotic critical values for D for one sample (small sample sizes):

\(\alpha\) 0.2 0.1 0.05 0.02 0.01
Critical \(D\) \(1.07\sqrt{\frac{1}{n}}\) \(1.14\cdot\sqrt{\frac{1}{n}}\) \(1.22\cdot\sqrt{\frac{1}{n}}\) \(1.52\cdot\sqrt{\frac{1}{n}}\) \(1.63 \cdot\sqrt{\frac{1}{n}}\)

critical values for D for two samples (\(n\le40\)):

Table of two-sample Kolmogorov-Smirnov test statistics D

Asymptotic critical values for D for one sample (large samle sizes):

\(\alpha\) 0.2 0.1 0.05 0.02 0.01
Critical \(D\) \(1.07\sqrt{\frac{m+n}{n}}\) \(1.14\cdot\sqrt{\frac{m+n}{n}}\) \(1.22\cdot\sqrt{\frac{m+n}{n}}\) \(1.52\cdot\sqrt{\frac{m+n}{n}}\) \(1.63 \cdot\sqrt{\frac{m+n}{n}}\)

4.2 One-sample test

Hypothesis:

\(H_{0}: F(x)=F_{0}(x)\) The CDF of x is a given reference distribution function \(F_0(x)\).

Test statistic:

\(D_{n}=\sup _{x}\left[\left|F_{n}(x)-F_{0}(x)\right|\right]\)

\(D_{n}\): the supremum distance between the empirical distribution function of given sample and the cumulative distribution function of the given reference distribution.

Example:

x0 <- c(108, 112, 117, 130, 111, 131, 113, 113, 105, 128)
x0_mean <- mean(x0)
x0_sd <- sd(x0)
eCDF <- ecdf(x0)

# cumulative distribution function F(x) of the reference distribution
CDF <- pnorm(x0, x0_mean, x0_sd)

# create a data frame to put values into
df <- data.frame(data = x0, eCDF = eCDF(x0), CDF = CDF)

# visualization
library(ggplot2)
ggplot(df, aes(data)) +
  stat_ecdf(size=1,aes(colour = "Empirical CDF (Fn(x))")) +
  stat_function(fun = pnorm, args = list(x0_mean, x0_sd), aes(colour = "Theoretical CDF (F(x))")) +
  xlab("Sample data") +
  ylab("Cumulative probability") +
  scale_y_continuous(breaks=seq(0, 1, by = 0.2))+
  theme(legend.title=element_blank())

# sort values of sample observations and remove duplicates
x <- unique(sort(x0))

# Calculate D
Daft <- abs(eCDF(x) - pnorm(x, x0_mean, x0_sd))
Dbef <- abs(c(0, eCDF(x)[-length(x)]) - pnorm(x, x0_mean, x0_sd))
D_score <- max(c(Daft, Dbef))

One step in R:

ks.test(x0,'pnorm', x0_mean, x0_sd)

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  x0
D = 0.25621, p-value = 0.5276
alternative hypothesis: two-sided

We cannot reject \(H_0\) that the data are normally distributed at \(\alpha = 0.05\). Conclusion: The data follows a normal distribution

4.3 Two-sample test

Hypothesis:

\(H_0\): Two populations follow a common probability distribution.

Test statistic:

\(D_{n,m}=\sup _{x}\left[\left|F_{1, n}(x)-F_{2, m}(x)\right|\right]\)

Example:

sample1 <- c(165, 168, 172, 177, 180, 182)
sample2 <- c(163, 167, 169, 175, 175, 179, 183, 185)
mean1 <- mean(sample1)
mean2 <- mean(sample2)
sd1 <- sd(sample1)
sd2 <- sd(sample2)

# empirical distribution function Fn(x)
eCDF1 <- ecdf(sample1)
eCDF1(sample1)
[1] 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333 1.0000000
# empirical distribution function Fm(x)
eCDF2 <- ecdf(sample2)
eCDF2(sample2)
[1] 0.125 0.250 0.375 0.625 0.625 0.750 0.875 1.000
# visualization
group2 <- c(rep("sample1", length(sample1)), rep("sample2", length(sample2)))
df2 <- data.frame(all = c(sample1,sample2), group = group2)
library(ggplot2)
ggplot(df2, aes(x = all, group = group, color = group2)) +
  stat_ecdf(size=1) +
  xlab("Sample data") +
  ylab("Cumulative probability") +
  theme(legend.title=element_blank())

# merge, sort observations of two samples, and remove duplicates 
x2 <- unique(sort(c(sample1,sample2)))

# calculate D and its location
D2 <- max(abs(eCDF1(x2) - eCDF2(x2)))
idxD2 <- which.max(abs(eCDF1(x2) - eCDF2(x2))) # the index of x-axis value
xD2 <- x2[idxD2] # corresponding x-axis value

One step in R:

ks.test(sample1, sample2)

    Exact two-sample Kolmogorov-Smirnov test

data:  sample1 and sample2
D = 0.25, p-value = 0.9281
alternative hypothesis: two-sided

We cannot reject the null hypothesis at \(\alpha = 0.05\). We can conclude that sample 1 and sample 2 come from the same distributions.

5 Further readings

  • The R book. Chapter 8.10.

6 Exercises

  1. Based on the data set state.x77, dose the murder rate follow a normal distribution?
  2. Based on the iris dataset, does the sepal width of versicolor follow the same probability distribution as virginica?