Given an observed random sample \({X_{1}}\), \({X_{2}}\), …, \({X_{n}}\), an empirical distribution function \({F_{n}}(X)\) is the fraction of sample observations less than or equal to the value x. More specifically, if \({y_{1}}\)<\({y_{2}}\)<..<\({y_{n}}\) are the order statistics of the observed random sample, with no two observations being equal, then the empirical distribution function is defined as:
\(F_{n}(x)= \begin{cases}0, & \text { for } x<y_{1} \\ k / n, & \text { for } y_{k} \leqslant x<y_{k+1}, k=1,2, \ldots, n-1 \\ 1, & \text { for } x \geqslant y_{n}\end{cases}\)
Commonly also called empirical Cumulative Distribution Function.
3.2 Example: normal distribution
Code
par(mfrow =c(2, 2))x1 <-rnorm(10)x2 <-rnorm(20)x3 <-rnorm(50)x4 <-rnorm(100)plot(ecdf(x1))curve(pnorm, col ='red', add =TRUE)plot(ecdf(x2))curve(pnorm, col ='red', add =TRUE)plot(ecdf(x3))curve(pnorm, col ='red', add =TRUE)plot(ecdf(x4))curve(pnorm, col ='red', add =TRUE)
4 Kolmogorov-Smirnov test
4.1 Principle
Kolmogorov-Smirnov distribution
Code
# Install the kolmin package locally.library(kolmim)pvalue <-pkolm(d =0.1, n =30)nks <-c(10, 20, 30, 50, 80, 100)x <-seq(0.01, 1, 0.01)y <-sapply(x, pkolm, n = nks[1])plot(x, y, type ='l', las =1, xlab ='D', ylab ='CDF')for (i in2:length(nks)) lines(x, sapply(x, pkolm, n = nks[i]), col = i)
critical values for D for one sample (\(n\le40\)):
Table of one-sample Kolmogorov-Smirnov test statistics D
Asymptotic critical values for D for one sample (small sample sizes):
\(\alpha\)
0.2
0.1
0.05
0.02
0.01
Critical \(D\)
\(1.07\sqrt{\frac{1}{n}}\)
\(1.14\cdot\sqrt{\frac{1}{n}}\)
\(1.22\cdot\sqrt{\frac{1}{n}}\)
\(1.52\cdot\sqrt{\frac{1}{n}}\)
\(1.63 \cdot\sqrt{\frac{1}{n}}\)
critical values for D for two samples (\(n\le40\)):
Table of two-sample Kolmogorov-Smirnov test statistics D
Asymptotic critical values for D for one sample (large samle sizes):
\(\alpha\)
0.2
0.1
0.05
0.02
0.01
Critical \(D\)
\(1.07\sqrt{\frac{m+n}{n}}\)
\(1.14\cdot\sqrt{\frac{m+n}{n}}\)
\(1.22\cdot\sqrt{\frac{m+n}{n}}\)
\(1.52\cdot\sqrt{\frac{m+n}{n}}\)
\(1.63 \cdot\sqrt{\frac{m+n}{n}}\)
4.2 One-sample test
Hypothesis:
\(H_{0}: F(x)=F_{0}(x)\) The CDF of x is a given reference distribution function \(F_0(x)\).
\(D_{n}\): the supremum distance between the empirical distribution function of given sample and the cumulative distribution function of the given reference distribution.
Example:
x0 <-c(108, 112, 117, 130, 111, 131, 113, 113, 105, 128)x0_mean <-mean(x0)x0_sd <-sd(x0)eCDF <-ecdf(x0)# cumulative distribution function F(x) of the reference distributionCDF <-pnorm(x0, x0_mean, x0_sd)# create a data frame to put values intodf <-data.frame(data = x0, eCDF =eCDF(x0), CDF = CDF)# visualizationlibrary(ggplot2)ggplot(df, aes(data)) +stat_ecdf(size=1,aes(colour ="Empirical CDF (Fn(x))")) +stat_function(fun = pnorm, args =list(x0_mean, x0_sd), aes(colour ="Theoretical CDF (F(x))")) +xlab("Sample data") +ylab("Cumulative probability") +scale_y_continuous(breaks=seq(0, 1, by =0.2))+theme(legend.title=element_blank())
# visualizationgroup2 <-c(rep("sample1", length(sample1)), rep("sample2", length(sample2)))df2 <-data.frame(all =c(sample1,sample2), group = group2)library(ggplot2)ggplot(df2, aes(x = all, group = group, color = group2)) +stat_ecdf(size=1) +xlab("Sample data") +ylab("Cumulative probability") +theme(legend.title=element_blank())
# merge, sort observations of two samples, and remove duplicates x2 <-unique(sort(c(sample1,sample2)))# calculate D and its locationD2 <-max(abs(eCDF1(x2) -eCDF2(x2)))idxD2 <-which.max(abs(eCDF1(x2) -eCDF2(x2))) # the index of x-axis valuexD2 <- x2[idxD2] # corresponding x-axis value
One step in R:
ks.test(sample1, sample2)
Exact two-sample Kolmogorov-Smirnov test
data: sample1 and sample2
D = 0.25, p-value = 0.9281
alternative hypothesis: two-sided
We cannot reject the null hypothesis at \(\alpha = 0.05\). We can conclude that sample 1 and sample 2 come from the same distributions.
5 Further readings
The R book. Chapter 8.10.
6 Exercises
Based on the data set state.x77, dose the murder rate follow a normal distribution?
Based on the iris dataset, does the sepal width of versicolor follow the same probability distribution as virginica?