Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objective
Understand the relationship between the R distribution functions.
Understand the Kolmogorov-Smirnov distribution and the Kolmogorov-Smirnov test statistic.
Use the Kolmogorov-Smirnov test for one-sample and two-sample hypothesis tests.
2 Distributions and their R functions
Distribution
p-
q-
d-
r-
Beta
pbeta
qbeta
dbeta
rbeta
Binomial
pbinom
qbinom
dbinom
rbinom
Cauchy
pcauchy
qcauchy
dcauchy
rcauchy
Chi-Square
pchisq
qchisq
dchisq
rchisq
Exponential)
pexp
qexp
dexp
rexp
F
pf
qf
df
rf
Gamma
pgamma
qgamma
dgamma
rgamma
Geometric
pgeom
qgeom
dgeom
rgeom
Hypergeometrictm
phyper
qhyper
dhyper
rhyper
Logistic
plogis
qlogis
dlogis
rlogis
Log Normal
plnorm
qlnorm
dlnorm
rlnorm
Negative Binomial
pnbinom
qnbinom
dnbinom
rnbinom
Normal
pnorm
qnorm
dnorm
rnorm
Poisson
ppois
qpois
dpois
rpois
Student t
pt
qt
dt
rt
Studentized Range
ptukey
qtukey
dtukey
rtukey
Uniform
punif
qunif
dunif
runif
Weibull
pweibull
qweibull
dweibull
rweibull
Wilcoxon Rank Sum Statistic
pwilcox
qwilcox
dwilcox
rwilcox
Wilcoxon Signed Rank Statistic
psignrank
qsignrank
dsignrank
rsignrank
Distribution functions in R: prefix + root_name
d for “density”, the density function (PDF).
p for “probability”, the cumulative distribution function (CDF).
q for “quantile” or “critical”, the inverse of CDF.
r for “random”, a random variable having the specified distribution
Distribution
Parameters
Expression
Mean
Variance
Bernoulli trial
Binomial
Poisson
Normal
Std. Normal
0
1
t
0
3 Empirical distribution function (EDF or eCDF)
3.1 Definition
Given an observed random sample , , …, , an empirical distribution function is the fraction of sample observations less than or equal to the value x. More specifically, if <<..< are the order statistics of the observed random sample, with no two observations being equal, then the empirical distribution function is defined as:
Commonly also called empirical Cumulative Distribution Function.
3.2 Example: normal distribution
Code
par(mfrow =c(2, 2))x1 <-rnorm(10)x2 <-rnorm(20)x3 <-rnorm(50)x4 <-rnorm(100)plot(ecdf(x1))curve(pnorm, col ='red', add =TRUE)plot(ecdf(x2))curve(pnorm, col ='red', add =TRUE)plot(ecdf(x3))curve(pnorm, col ='red', add =TRUE)plot(ecdf(x4))curve(pnorm, col ='red', add =TRUE)
4 Kolmogorov-Smirnov test
4.1 Principle
Kolmogorov-Smirnov distribution
Code
# Install the kolmin package locally.library(kolmim)pvalue <-pkolm(d =0.1, n =30)nks <-c(10, 20, 30, 50, 80, 100)x <-seq(0.01, 1, 0.01)y <-sapply(x, pkolm, n = nks[1])plot(x, y, type ='l', las =1, xlab ='D', ylab ='CDF')for (i in2:length(nks)) lines(x, sapply(x, pkolm, n = nks[i]), col = i)
critical values for D for one sample ():
Table of one-sample Kolmogorov-Smirnov test statistics D
Asymptotic critical values for D for one sample (small sample sizes):
0.2
0.1
0.05
0.02
0.01
Critical
critical values for D for two samples ():
Table of two-sample Kolmogorov-Smirnov test statistics D
Asymptotic critical values for D for one sample (large samle sizes):
0.2
0.1
0.05
0.02
0.01
Critical
4.2 One-sample test
Hypothesis:
The CDF of x is a given reference distribution function .
Test statistic:
: the supremum distance between the empirical distribution function of given sample and the cumulative distribution function of the given reference distribution.
Example:
x0 <-c(108, 112, 117, 130, 111, 131, 113, 113, 105, 128)x0_mean <-mean(x0)x0_sd <-sd(x0)eCDF <-ecdf(x0)# cumulative distribution function F(x) of the reference distributionCDF <-pnorm(x0, x0_mean, x0_sd)# create a data frame to put values intodf <-data.frame(data = x0, eCDF =eCDF(x0), CDF = CDF)# visualizationlibrary(ggplot2)ggplot(df, aes(data)) +stat_ecdf(size=1,aes(colour ="Empirical CDF (Fn(x))")) +stat_function(fun = pnorm, args =list(x0_mean, x0_sd), aes(colour ="Theoretical CDF (F(x))")) +xlab("Sample data") +ylab("Cumulative probability") +scale_y_continuous(breaks=seq(0, 1, by =0.2))+theme(legend.title=element_blank())
# visualizationgroup2 <-c(rep("sample1", length(sample1)), rep("sample2", length(sample2)))df2 <-data.frame(all =c(sample1,sample2), group = group2)library(ggplot2)ggplot(df2, aes(x = all, group = group, color = group2)) +stat_ecdf(size=1) +xlab("Sample data") +ylab("Cumulative probability") +theme(legend.title=element_blank())
# merge, sort observations of two samples, and remove duplicates x2 <-unique(sort(c(sample1,sample2)))# calculate D and its locationD2 <-max(abs(eCDF1(x2) -eCDF2(x2)))idxD2 <-which.max(abs(eCDF1(x2) -eCDF2(x2))) # the index of x-axis valuexD2 <- x2[idxD2] # corresponding x-axis value
One step in R:
ks.test(sample1, sample2)
Exact two-sample Kolmogorov-Smirnov test
data: sample1 and sample2
D = 0.25, p-value = 0.9281
alternative hypothesis: two-sided
We cannot reject the null hypothesis at . We can conclude that sample 1 and sample 2 come from the same distributions.
5 Further readings
The R book. Chapter 8.10.
6 Exercises
Based on the data set state.x77, dose the murder rate follow a normal distribution?
Based on the iris dataset, does the sepal width of versicolor follow the same probability distribution as virginica?