Analysis of covariance (ANCOVA)
Numercial Vs. categorical, adapted for covariate
Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objectives
- What is a covariate, and why we need ANCOVA.
- Carry out ANCOVA for a scientific question.
- Visualize the ANCOVA model.
2 Principle
2.1 Definition
- Analysis of covariance (ANCOVA):
-
ANOVA tuned with linear regression.
Dependent variable (DV): One continuous variable
Independent variables (IVs): One or multiple categorical variables, one or multiple continuous variables (covariate, CV)
Analysis for the differences among group means for a linear combination of the dependent variable after adjusted for the covariate
Test whether the independent variable(s) has a significant influence on the dependent variable, excluding the influence of the covariate (preferably highly correlated with the dependent variable)
- Covariate (CV):
-
An independent variable that is not manipulated by experimenters but still influences experimental results.
Example: The influence of the sex and age on the body weight.
2.2 Model
\[Y_{ij} = (\mu+\tau_{i})+\beta(x_{ij}-\bar{x})+\epsilon_{ij}\]
- \(Y_{ij}\): the j-th observation under the i-th categorical group
- \(\mu\): the population mean
- \(i\): groups, 1,2, …, \(r\)
- \(j\): observations, 1,2,…, \(n_i\)
- \(\tau_i\): an adjustment to the \(y\) intercept for the i-th regression line
- \(\mu + \tau_i\): the intercept for group \(i\)
- \(\beta\): the slope of the line
- \(x_{ij}\): the j-th observation of the continuous variable under the i-th group
- \(\bar x\): the global mean of the variable \(x\)
- \(\epsilon _{ij}\): the associated unobserved error
Example: The influence of the sex (S, m for male and f for female) and age (A) on the body weight (W).
\[W_m = a_m + b_m A\]
\[W_f = a_f + b_f A\] Model simplification:
3 One-way ANCOVA
3.1 Questions
Q1: Does grazing have influence on the fruit production? Are grazed plants have more fruit production than ungrazed ones?
- Dependent variable: Fruit production (continuous)
- Independent variable: Grazing (categorical)
<- read.table("data/ipomopsis.txt", header = TRUE, stringsAsFactors = TRUE)
df1 tapply(df1$Fruit,df1$Grazing, mean)
Grazed Ungrazed
67.9405 50.8805
library(ggplot2)
ggplot(df1) +
geom_boxplot(aes(Fruit, Grazing))
What hypothesis test?
t.test(Fruit ~ Grazing, data = df1, alternative = c("greater"))
How about the root diameter…
Q2: what is the influence of grazing and root diameter on the fruit production of a plant?
Dependent variable:
- fruit production (mg)
Independent variables:
- grazing (categorical: grazed or ungrazed)
- root diameter (continuous, mm, covariate)
ggplot(df1, aes(Root, Fruit))+
geom_point() +
geom_smooth(method = 'lm') +
geom_point(aes(color = Grazing)) +
geom_smooth(aes(color = Grazing), method = 'lm')
- The initial root size of each plant is different
- The grazed plants have bigger root sizes
- The regression lines have a same slope and two different intercepts
Three possible factors:
- Root size (intercept)
- Grazing (intercept)
- Interaction between root size and grazing (slope)
3.2 Maximal model
Symbol | Meaning |
---|---|
~ |
Separating DV (left) and IV (right) |
: |
Interaction effect of two factors |
* |
Main effect of the two factors and the interaction effect. f1 * f2 is equivalent to f1 + f2 + f1:f2 |
^ |
Square the sum of several terms. The main effect of these terms and the interaction between them |
. |
All variables except the DV |
# The maximal model
<- lm(Fruit ~ Grazing * Root, data = df1)
df1_ancova summary(df1_ancova)
Call:
lm(formula = Fruit ~ Grazing * Root, data = df1)
Residuals:
Min 1Q Median 3Q Max
-17.3177 -2.8320 0.1247 3.8511 17.1313
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -125.173 12.811 -9.771 1.15e-11 ***
GrazingUngrazed 30.806 16.842 1.829 0.0757 .
Root 23.240 1.531 15.182 < 2e-16 ***
GrazingUngrazed:Root 0.756 2.354 0.321 0.7500
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.831 on 36 degrees of freedom
Multiple R-squared: 0.9293, Adjusted R-squared: 0.9234
F-statistic: 157.6 on 3 and 36 DF, p-value: < 2.2e-16
# The ANOVA table for the maximal model
anova(df1_ancova)
Analysis of Variance Table
Response: Fruit
Df Sum Sq Mean Sq F value Pr(>F)
Grazing 1 2910.4 2910.4 62.3795 2.262e-09 ***
Root 1 19148.9 19148.9 410.4201 < 2.2e-16 ***
Grazing:Root 1 4.8 4.8 0.1031 0.75
Residuals 36 1679.6 46.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# other method to see the ANOVA table
<- aov(Fruit ~ Grazing * Root, data = df1)
df1_aov summary(df1_aov)
Df Sum Sq Mean Sq F value Pr(>F)
Grazing 1 2910 2910 62.380 2.26e-09 ***
Root 1 19149 19149 410.420 < 2e-16 ***
Grazing:Root 1 5 5 0.103 0.75
Residuals 36 1680 47
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(df1_ancova)
Df Sum Sq Mean Sq F value Pr(>F)
Grazing 1 2910 2910 62.380 2.26e-09 ***
Root 1 19149 19149 410.420 < 2e-16 ***
Grazing:Root 1 5 5 0.103 0.75
Residuals 36 1680 47
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.3 Minimal model
# Delete the interaction factor
<- update(df1_ancova, ~ . - Grazing:Root)
df1_ancova2 summary(df1_ancova2)
Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)
Residuals:
Min 1Q Median 3Q Max
-17.1920 -2.8224 0.3223 3.9144 17.3290
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -127.829 9.664 -13.23 1.35e-15 ***
GrazingUngrazed 36.103 3.357 10.75 6.11e-13 ***
Root 23.560 1.149 20.51 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.747 on 37 degrees of freedom
Multiple R-squared: 0.9291, Adjusted R-squared: 0.9252
F-statistic: 242.3 on 2 and 37 DF, p-value: < 2.2e-16
# Compare the simplified model with the maximal model
anova(df1_ancova, df1_ancova2)
Analysis of Variance Table
Model 1: Fruit ~ Grazing * Root
Model 2: Fruit ~ Grazing + Root
Res.Df RSS Df Sum of Sq F Pr(>F)
1 36 1679.7
2 37 1684.5 -1 -4.8122 0.1031 0.75
\(p = 0.75 > 0.05\):
- the two models have no significant difference at \(\alpha = 0.05\)
- the model simplification was justified.
# Delete the grazing factor
<- update(df1_ancova2, ~ . - Grazing)
df1_ancova3 summary(df1_ancova3)
Call:
lm(formula = Fruit ~ Root, data = df1)
Residuals:
Min 1Q Median 3Q Max
-29.3844 -10.4447 -0.7574 10.7606 23.7556
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.286 10.723 -3.850 0.000439 ***
Root 14.022 1.463 9.584 1.1e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.52 on 38 degrees of freedom
Multiple R-squared: 0.7073, Adjusted R-squared: 0.6996
F-statistic: 91.84 on 1 and 38 DF, p-value: 1.099e-11
# Compare the two models
anova(df1_ancova2, df1_ancova3)
Analysis of Variance Table
Model 1: Fruit ~ Grazing + Root
Model 2: Fruit ~ Root
Res.Df RSS Df Sum of Sq F Pr(>F)
1 37 1684.5
2 38 6948.8 -1 -5264.4 115.63 6.107e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(p < 0.05\):
- removing the Grazing factor causes a significant change to the model.
- The effect of grazing on fruit production is highly significant and needs to be retained in the model.
df1_ancova2
is the minimal adequate model.
summary(df1_ancova2)
Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)
Residuals:
Min 1Q Median 3Q Max
-17.1920 -2.8224 0.3223 3.9144 17.3290
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -127.829 9.664 -13.23 1.35e-15 ***
GrazingUngrazed 36.103 3.357 10.75 6.11e-13 ***
Root 23.560 1.149 20.51 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.747 on 37 degrees of freedom
Multiple R-squared: 0.9291, Adjusted R-squared: 0.9252
F-statistic: 242.3 on 2 and 37 DF, p-value: < 2.2e-16
anova(df1_ancova2)
Analysis of Variance Table
Response: Fruit
Df Sum Sq Mean Sq F value Pr(>F)
Grazing 1 2910.4 2910.4 63.929 1.397e-09 ***
Root 1 19148.9 19148.9 420.616 < 2.2e-16 ***
Residuals 37 1684.5 45.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Grazing is associated with a 36.103 mg reduction in fruit production.
3.4 One step
Criterion: Akaike’s information criterion (AIC). The model is worse if AIC gets greater.
step(df1_ancova)
Start: AIC=157.5
Fruit ~ Grazing * Root
Df Sum of Sq RSS AIC
- Grazing:Root 1 4.8122 1684.5 155.61
<none> 1679.7 157.50
Step: AIC=155.61
Fruit ~ Grazing + Root
Df Sum of Sq RSS AIC
<none> 1684.5 155.61
- Grazing 1 5264.4 6948.8 210.30
- Root 1 19148.9 20833.4 254.22
Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)
Coefficients:
(Intercept) GrazingUngrazed Root
-127.83 36.10 23.56
3.5 Results
::extract_eq(df1_ancova2, use_coefs = TRUE) equatiomatic
\[ \operatorname{\widehat{Fruit}} = -127.83 + 36.1(\operatorname{Grazing}_{\operatorname{Ungrazed}}) + 23.56(\operatorname{Root}) \]
::stargazer(df1_ancova2, type = 'html') stargazer
Dependent variable: | |
Fruit | |
GrazingUngrazed | 36.103*** |
(3.357) | |
Root | 23.560*** |
(1.149) | |
Constant | -127.829*** |
(9.664) | |
Observations | 40 |
R2 | 0.929 |
Adjusted R2 | 0.925 |
Residual Std. Error | 6.747 (df = 37) |
F Statistic | 242.272*** (df = 2; 37) |
Note: | p<0.1; p<0.05; p<0.01 |
4 Two-way ANCOVA
- Dependent variables: One continuous variable
- Independent variables: Two categorical variables, one or multiple continuous variables (covariate)
4.1 Question
Previous experiments have shown that both genotype and sex of an organism affect body weight gain. However, a scientist believes that after adjusting for age, there was no significant difference in means of weight gain between groups at different levels of sex and Genotype. Can experiments support this claim?1
1 The R book, Chapter 12.3
- Dependent variable:
- Weight gain (continuous)
- Independent variables:
- genotype (categorical)
- sex (categorical)
- age (covariate)
<- read.table("data/Gain.txt", header = T) Gain
4.2 Models
<- lm(Weight ~ Sex * Age * Genotype, data = Gain)
m1 summary(m1)
Call:
lm(formula = Weight ~ Sex * Age * Genotype, data = Gain)
Residuals:
Min 1Q Median 3Q Max
-0.40218 -0.12043 -0.01065 0.12592 0.44687
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.80053 0.24941 31.276 < 2e-16 ***
Sexmale -0.51966 0.35272 -1.473 0.14936
Age 0.34950 0.07520 4.648 4.39e-05 ***
GenotypeCloneB 1.19870 0.35272 3.398 0.00167 **
GenotypeCloneC -0.41751 0.35272 -1.184 0.24429
GenotypeCloneD 0.95600 0.35272 2.710 0.01023 *
GenotypeCloneE -0.81604 0.35272 -2.314 0.02651 *
GenotypeCloneF 1.66851 0.35272 4.730 3.41e-05 ***
Sexmale:Age -0.11283 0.10635 -1.061 0.29579
Sexmale:GenotypeCloneB -0.31716 0.49882 -0.636 0.52891
Sexmale:GenotypeCloneC -1.06234 0.49882 -2.130 0.04010 *
Sexmale:GenotypeCloneD -0.73547 0.49882 -1.474 0.14906
Sexmale:GenotypeCloneE -0.28533 0.49882 -0.572 0.57087
Sexmale:GenotypeCloneF -0.19839 0.49882 -0.398 0.69319
Age:GenotypeCloneB -0.10146 0.10635 -0.954 0.34643
Age:GenotypeCloneC -0.20825 0.10635 -1.958 0.05799 .
Age:GenotypeCloneD -0.01757 0.10635 -0.165 0.86970
Age:GenotypeCloneE -0.03825 0.10635 -0.360 0.72123
Age:GenotypeCloneF -0.05512 0.10635 -0.518 0.60743
Sexmale:Age:GenotypeCloneB 0.15469 0.15040 1.029 0.31055
Sexmale:Age:GenotypeCloneC 0.35322 0.15040 2.349 0.02446 *
Sexmale:Age:GenotypeCloneD 0.19227 0.15040 1.278 0.20929
Sexmale:Age:GenotypeCloneE 0.13203 0.15040 0.878 0.38585
Sexmale:Age:GenotypeCloneF 0.08709 0.15040 0.579 0.56616
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2378 on 36 degrees of freedom
Multiple R-squared: 0.9742, Adjusted R-squared: 0.9577
F-statistic: 59.06 on 23 and 36 DF, p-value: < 2.2e-16
<- step(m1) m2
Start: AIC=-155.01
Weight ~ Sex * Age * Genotype
Df Sum of Sq RSS AIC
- Sex:Age:Genotype 5 0.34912 2.3849 -155.51
<none> 2.0358 -155.01
Step: AIC=-155.51
Weight ~ Sex + Age + Genotype + Sex:Age + Sex:Genotype + Age:Genotype
Df Sum of Sq RSS AIC
- Sex:Genotype 5 0.146901 2.5318 -161.92
- Age:Genotype 5 0.168136 2.5531 -161.42
- Sex:Age 1 0.048937 2.4339 -156.29
<none> 2.3849 -155.51
Step: AIC=-161.92
Weight ~ Sex + Age + Genotype + Sex:Age + Age:Genotype
Df Sum of Sq RSS AIC
- Age:Genotype 5 0.168136 2.7000 -168.07
- Sex:Age 1 0.048937 2.5808 -162.78
<none> 2.5318 -161.92
Step: AIC=-168.07
Weight ~ Sex + Age + Genotype + Sex:Age
Df Sum of Sq RSS AIC
- Sex:Age 1 0.049 2.749 -168.989
<none> 2.700 -168.066
- Genotype 5 54.958 57.658 5.612
Step: AIC=-168.99
Weight ~ Sex + Age + Genotype
Df Sum of Sq RSS AIC
<none> 2.749 -168.989
- Sex 1 10.374 13.122 -77.201
- Age 1 10.770 13.519 -75.415
- Genotype 5 54.958 57.707 3.662
- The slope of weight gain against age does not vary with sex or genotype
- The intercepts vary
- The minimal adequate model: three main effects but no interactions
summary(m2)
Call:
lm(formula = Weight ~ Sex + Age + Genotype, data = Gain)
Residuals:
Min 1Q Median 3Q Max
-0.40005 -0.15120 -0.01668 0.16953 0.49227
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.93701 0.10066 78.851 < 2e-16 ***
Sexmale -0.83161 0.05937 -14.008 < 2e-16 ***
Age 0.29958 0.02099 14.273 < 2e-16 ***
GenotypeCloneB 0.96778 0.10282 9.412 8.07e-13 ***
GenotypeCloneC -1.04361 0.10282 -10.149 6.21e-14 ***
GenotypeCloneD 0.82396 0.10282 8.013 1.21e-10 ***
GenotypeCloneE -0.87540 0.10282 -8.514 1.98e-11 ***
GenotypeCloneF 1.53460 0.10282 14.925 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2299 on 52 degrees of freedom
Multiple R-squared: 0.9651, Adjusted R-squared: 0.9604
F-statistic: 205.7 on 7 and 52 DF, p-value: < 2.2e-16
Can we reduce the genotypes’ factor level from the present six to four without losing any explanatory power?
<- as.factor(Gain$Genotype)
newGenotype levels(newGenotype)
[1] "CloneA" "CloneB" "CloneC" "CloneD" "CloneE" "CloneF"
levels(newGenotype)[c(3,5)] <- "ClonesCandE"
levels(newGenotype)[c(2,4)] <- "ClonesBandD"
levels(newGenotype)
[1] "CloneA" "ClonesBandD" "ClonesCandE" "CloneF"
<- lm(data=Gain,Weight~Sex+Age+newGenotype)
m3 anova(m2,m3)
Analysis of Variance Table
Model 1: Weight ~ Sex + Age + Genotype
Model 2: Weight ~ Sex + Age + newGenotype
Res.Df RSS Df Sum of Sq F Pr(>F)
1 52 2.7489
2 54 2.9938 -2 -0.24489 2.3163 0.1087
4.3 Results
As \(p =0.1087\), there is no significant difference between the two models. Therefore, the new model uses NewGenotype (four levels) instead of Genotype (six levels).
summary(m3)
Call:
lm(formula = Weight ~ Sex + Age + newGenotype, data = Gain)
Residuals:
Min 1Q Median 3Q Max
-0.42651 -0.16687 0.01211 0.18776 0.47736
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.93701 0.10308 76.996 < 2e-16 ***
Sexmale -0.83161 0.06080 -13.679 < 2e-16 ***
Age 0.29958 0.02149 13.938 < 2e-16 ***
newGenotypeClonesBandD 0.89587 0.09119 9.824 1.28e-13 ***
newGenotypeClonesCandE -0.95950 0.09119 -10.522 1.10e-14 ***
newGenotypeCloneF 1.53460 0.10530 14.574 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2355 on 54 degrees of freedom
Multiple R-squared: 0.962, Adjusted R-squared: 0.9585
F-statistic: 273.7 on 5 and 54 DF, p-value: < 2.2e-16
::extract_eq(m3, use_coefs = TRUE) equatiomatic
\[ \operatorname{\widehat{Weight}} = 7.94 - 0.83(\operatorname{Sex}_{\operatorname{male}}) + 0.3(\operatorname{Age}) + 0.9(\operatorname{newGenotype}_{\operatorname{ClonesBandD}}) - 0.96(\operatorname{newGenotype}_{\operatorname{ClonesCandE}}) + 1.53(\operatorname{newGenotype}_{\operatorname{CloneF}}) \]
::stargazer(m3, type = 'html') stargazer
Dependent variable: | |
Weight | |
Sexmale | -0.832*** |
(0.061) | |
Age | 0.300*** |
(0.021) | |
newGenotypeClonesBandD | 0.896*** |
(0.091) | |
newGenotypeClonesCandE | -0.960*** |
(0.091) | |
newGenotypeCloneF | 1.535*** |
(0.105) | |
Constant | 7.937*** |
(0.103) | |
Observations | 60 |
R2 | 0.962 |
Adjusted R2 | 0.959 |
Residual Std. Error | 0.235 (df = 54) |
F Statistic | 273.652*** (df = 5; 54) |
Note: | p<0.1; p<0.05; p<0.01 |
4.4 Visualization
plot(Weight~Age,data=Gain,type="n")
<- c("green","red","black","blue")
colours <- c(1,2)
lines <- c(16,17)
symbols <-as.factor(Gain$Sex)
NewSexpoints(Weight~Age,data=Gain,pch=symbols[as.numeric(NewSex)],
col=colours[as.numeric(newGenotype)])
<- c(1,5)
xv for (i in 1:2) {
for (j in 1:4){
<- coef(m3)[1]+(i>1)* coef(m3)[2]+(j>1)*coef(m3)[j+2]
a
<- coef(m3)[3]
b <- a+b*xv
yv lines(xv,yv,lty=lines[i],col=colours[j]) } }
5 Further readings
- The R Book, Chapter 12
- Linear Models with R, Chapter 13
6 Exercises
The effect of a drug on the birth weight of baby mice was studied. Pregnant mice were divided into four groups, and each group was treated with different doses (0, 5, 50 or 500). The gestation and the birth weight of baby mice were recorded. The dataset is available as litter in the multcomp package.
data(litter, package="multcomp")
help('litter', package="multcomp")
Does the drug dose have effect on the weight?
- Carry out ANOVA with the drug dose as the independent variable and weight as the dependent variable. What is your conclusion?
- Carry out ANCOVa with the drug dose as the independent variable, weight as the dependent variable, and gestation as the covariate. What is your conclusion?