Tests for distributions

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objective

Understand the relationship between the R distribution functions.
Understand the Kolmogorov-Smirnov distribution and the Kolmogorov-Smirnov test statistic.
Use the Kolmogorov-Smirnov test for one-sample and two-sample hypothesis tests.

2 Distributions and their R functions

Distribution	`p-`	`q-`	`d-`	`r-`
Beta	`pbeta`	`qbeta`	`dbeta`	`rbeta`
Binomial	`pbinom`	`qbinom`	`dbinom`	`rbinom`
Cauchy	`pcauchy`	`qcauchy`	`dcauchy`	`rcauchy`
Chi-Square	`pchisq`	`qchisq`	`dchisq`	`rchisq`
Exponential)	`pexp`	`qexp`	`dexp`	`rexp`
F	`pf`	`qf`	`df`	`rf`
Gamma	`pgamma`	`qgamma`	`dgamma`	`rgamma`
Geometric	`pgeom`	`qgeom`	`dgeom`	`rgeom`
Hypergeometrictm	`phyper`	`qhyper`	`dhyper`	`rhyper`
Logistic	`plogis`	`qlogis`	`dlogis`	`rlogis`
Log Normal	`plnorm`	`qlnorm`	`dlnorm`	`rlnorm`
Negative Binomial	`pnbinom`	`qnbinom`	`dnbinom`	`rnbinom`
Normal	`pnorm`	`qnorm`	`dnorm`	`rnorm`
Poisson	`ppois`	`qpois`	`dpois`	`rpois`
Student t	`pt`	`qt`	`dt`	`rt`
Studentized Range	`ptukey`	`qtukey`	`dtukey`	`rtukey`
Uniform	`punif`	`qunif`	`dunif`	`runif`
Weibull	`pweibull`	`qweibull`	`dweibull`	`rweibull`
Wilcoxon Rank Sum Statistic	`pwilcox`	`qwilcox`	`dwilcox`	`rwilcox`
Wilcoxon Signed Rank Statistic	`psignrank`	`qsignrank`	`dsignrank`	`rsignrank`

Distribution functions in R: prefix + root_name

d for “density”, the density function (PDF).
p for “probability”, the cumulative distribution function (CDF).
q for “quantile” or “critical”, the inverse of CDF.
r for “random”, a random variable having the specified distribution

Distribution	Parameters	Expression	Mean	Variance
Bernoulli trial	\(p\)	\(p\)	\(p\)	\(pq\)
Binomial	\(n, p\)	\(C(n, x) p^x q^{n-x}\)	\(np\)	\(npq\)
Poisson	\(\lambda\)	\(e^{-\lambda} \lambda^x / x !\)	\(\lambda\)	\(\lambda\)
Normal	\(\mu, \sigma\)	\(\frac{1}{\sqrt{2 \pi \sigma}} e^{-(x-\mu)^2 / 2 \sigma^2}\)	\(\mu\)	\(\sigma^2\)
Std. Normal	\(z\)	\(\frac{1}{\sqrt{2 \pi}} e^{-z^2 / 2}\)	0	1
t	\(v\)	\(\frac{\Gamma\left(\frac{v+1}{2}\right)}{\sqrt{v \pi} \Gamma\left(\frac{v}{2}\right)}\left(1+\frac{x^2}{v}\right)^{-\frac{v+1}{2}}\)	0	\(\left\{\begin{array}{cc}v /(v-2) & \text { for } v>2 \\ \infty & \text { for } 1<v \leq 2 \\ \text { undefined } & \text { otherwise }\end{array}\right.\)

3 Empirical distribution function (EDF or eCDF)

3.1 Definition

Given an observed random sample \({X_{1}}\), \({X_{2}}\), …, \({X_{n}}\), an empirical distribution function \({F_{n}}(X)\) is the fraction of sample observations less than or equal to the value x. More specifically, if \({y_{1}}\)<\({y_{2}}\)<..<\({y_{n}}\) are the order statistics of the observed random sample, with no two observations being equal, then the empirical distribution function is defined as:

\(F_{n}(x)= \begin{cases}0, & \text { for } x<y_{1} \\ k / n, & \text { for } y_{k} \leqslant x<y_{k+1}, k=1,2, \ldots, n-1 \\ 1, & \text { for } x \geqslant y_{n}\end{cases}\)

Commonly also called empirical Cumulative Distribution Function.

3.2 Example: normal distribution

Code

par(mfrow = c(2, 2))
x1 <- rnorm(10)
x2 <- rnorm(20)
x3 <- rnorm(50)
x4 <- rnorm(100)
plot(ecdf(x1))
curve(pnorm, col = 'red', add = TRUE)
plot(ecdf(x2))
curve(pnorm, col = 'red', add = TRUE)
plot(ecdf(x3))
curve(pnorm, col = 'red', add = TRUE)
plot(ecdf(x4))
curve(pnorm, col = 'red', add = TRUE)

4 Kolmogorov-Smirnov test

4.1 Principle

Kolmogorov-Smirnov distribution

Code

# Install the kolmin package locally.
library(kolmim)
pvalue <- pkolm(d = 0.1, n = 30)
nks <- c(10, 20, 30, 50, 80, 100)
x <- seq(0.01, 1, 0.01)
y <- sapply(x, pkolm, n = nks[1])
plot(x, y, type = 'l', las = 1, xlab = 'D', ylab = 'CDF')
for (i in 2:length(nks)) lines(x, sapply(x, pkolm, n = nks[i]), col = i)

critical values for D for one sample (\(n\le40\)):

Table of one-sample Kolmogorov-Smirnov test statistics D

Asymptotic critical values for D for one sample (small sample sizes):

\(\alpha\)	0.2	0.1	0.05	0.02	0.01
Critical \(D\)	\(1.07\sqrt{\frac{1}{n}}\)	\(1.14\cdot\sqrt{\frac{1}{n}}\)	\(1.22\cdot\sqrt{\frac{1}{n}}\)	\(1.52\cdot\sqrt{\frac{1}{n}}\)	\(1.63 \cdot\sqrt{\frac{1}{n}}\)

critical values for D for two samples (\(n\le40\)):

Table of two-sample Kolmogorov-Smirnov test statistics D

Asymptotic critical values for D for one sample (large samle sizes):

\(\alpha\)	0.2	0.1	0.05	0.02	0.01
Critical \(D\)	\(1.07\sqrt{\frac{m+n}{n}}\)	\(1.14\cdot\sqrt{\frac{m+n}{n}}\)	\(1.22\cdot\sqrt{\frac{m+n}{n}}\)	\(1.52\cdot\sqrt{\frac{m+n}{n}}\)	\(1.63 \cdot\sqrt{\frac{m+n}{n}}\)

4.2 One-sample test

Hypothesis:

\(H_{0}: F(x)=F_{0}(x)\) The CDF of x is a given reference distribution function \(F_0(x)\).

Test statistic:

\(D_{n}=\sup _{x}\left[\left|F_{n}(x)-F_{0}(x)\right|\right]\)

\(D_{n}\): the supremum distance between the empirical distribution function of given sample and the cumulative distribution function of the given reference distribution.

Example:

x0 <- c(108, 112, 117, 130, 111, 131, 113, 113, 105, 128)
x0_mean <- mean(x0)
x0_sd <- sd(x0)
eCDF <- ecdf(x0)

# cumulative distribution function F(x) of the reference distribution
CDF <- pnorm(x0, x0_mean, x0_sd)

# create a data frame to put values into
df <- data.frame(data = x0, eCDF = eCDF(x0), CDF = CDF)

# visualization
library(ggplot2)
ggplot(df, aes(data)) +
  stat_ecdf(size=1,aes(colour = "Empirical CDF (Fn(x))")) +
  stat_function(fun = pnorm, args = list(x0_mean, x0_sd), aes(colour = "Theoretical CDF (F(x))")) +
  xlab("Sample data") +
  ylab("Cumulative probability") +
  scale_y_continuous(breaks=seq(0, 1, by = 0.2))+
  theme(legend.title=element_blank())

# sort values of sample observations and remove duplicates
x <- unique(sort(x0))

# Calculate D
Daft <- abs(eCDF(x) - pnorm(x, x0_mean, x0_sd))
Dbef <- abs(c(0, eCDF(x)[-length(x)]) - pnorm(x, x0_mean, x0_sd))
D_score <- max(c(Daft, Dbef))

One step in R:

ks.test(x0,'pnorm', x0_mean, x0_sd)


    Asymptotic one-sample Kolmogorov-Smirnov test

data:  x0
D = 0.25621, p-value = 0.5276
alternative hypothesis: two-sided

We cannot reject \(H_0\) that the data are normally distributed at \(\alpha = 0.05\). Conclusion: The data follows a normal distribution

4.3 Two-sample test

Hypothesis:

\(H_0\): Two populations follow a common probability distribution.

Test statistic:

\(D_{n,m}=\sup _{x}\left[\left|F_{1, n}(x)-F_{2, m}(x)\right|\right]\)

Example:

sample1 <- c(165, 168, 172, 177, 180, 182)
sample2 <- c(163, 167, 169, 175, 175, 179, 183, 185)
mean1 <- mean(sample1)
mean2 <- mean(sample2)
sd1 <- sd(sample1)
sd2 <- sd(sample2)

# empirical distribution function Fn(x)
eCDF1 <- ecdf(sample1)
eCDF1(sample1)

[1] 0.1666667 0.3333333 0.5000000 0.6666667 0.8333333 1.0000000

# empirical distribution function Fm(x)
eCDF2 <- ecdf(sample2)
eCDF2(sample2)

[1] 0.125 0.250 0.375 0.625 0.625 0.750 0.875 1.000

# visualization
group2 <- c(rep("sample1", length(sample1)), rep("sample2", length(sample2)))
df2 <- data.frame(all = c(sample1,sample2), group = group2)
library(ggplot2)
ggplot(df2, aes(x = all, group = group, color = group2)) +
  stat_ecdf(size=1) +
  xlab("Sample data") +
  ylab("Cumulative probability") +
  theme(legend.title=element_blank())

# merge, sort observations of two samples, and remove duplicates 
x2 <- unique(sort(c(sample1,sample2)))

# calculate D and its location
D2 <- max(abs(eCDF1(x2) - eCDF2(x2)))
idxD2 <- which.max(abs(eCDF1(x2) - eCDF2(x2))) # the index of x-axis value
xD2 <- x2[idxD2] # corresponding x-axis value

One step in R:

ks.test(sample1, sample2)


    Exact two-sample Kolmogorov-Smirnov test

data:  sample1 and sample2
D = 0.25, p-value = 0.9281
alternative hypothesis: two-sided

We cannot reject the null hypothesis at \(\alpha = 0.05\). We can conclude that sample 1 and sample 2 come from the same distributions.

5 Further readings

The R book. Chapter 8.10.

6 Exercises

Based on the data set state.x77, dose the murder rate follow a normal distribution?
Based on the iris dataset, does the sepal width of versicolor follow the same probability distribution as virginica?