Analysis of covariance (ANCOVA)

Numercial Vs. categorical, adapted for covariate

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objectives

What is a covariate, and why we need ANCOVA.
Carry out ANCOVA for a scientific question.
Visualize the ANCOVA model.

2 Principle

2.1 Definition

Analysis of covariance (ANCOVA):

ANOVA tuned with linear regression.

Dependent variable (DV): One continuous variable

Independent variables (IVs): One or multiple categorical variables, one or multiple continuous variables (covariate, CV)

Analysis for the differences among group means for a linear combination of the dependent variable after adjusted for the covariate

Test whether the independent variable(s) has a significant influence on the dependent variable, excluding the influence of the covariate (preferably highly correlated with the dependent variable)

Covariate (CV):

An independent variable that is not manipulated by experimenters but still influences experimental results.

Example: The influence of the sex and age on the body weight.

2.2 Model

\[Y_{ij} = (\mu+\tau_{i})+\beta(x_{ij}-\bar{x})+\epsilon_{ij}\]

\(Y_{ij}\): the j-th observation under the i-th categorical group
\(\mu\): the population mean
\(i\): groups, 1,2, …, \(r\)
\(j\): observations, 1,2,…, \(n_i\)
\(\tau_i\): an adjustment to the \(y\) intercept for the i-th regression line
\(\mu + \tau_i\): the intercept for group \(i\)
\(\beta\): the slope of the line
\(x_{ij}\): the j-th observation of the continuous variable under the i-th group
\(\bar x\): the global mean of the variable \(x\)
\(\epsilon _{ij}\): the associated unobserved error

Example: The influence of the sex (S, m for male and f for female) and age (A) on the body weight (W).

\[W_m = a_m + b_m A\]

\[W_f = a_f + b_f A\] Model simplification:

3 One-way ANCOVA

3.1 Questions

Q1: Does grazing have influence on the fruit production? Are grazed plants have more fruit production than ungrazed ones?

Dependent variable: Fruit production (continuous)
Independent variable: Grazing (categorical)

df1 <- read.table("data/ipomopsis.txt", header = TRUE, stringsAsFactors = TRUE)
tapply(df1$Fruit,df1$Grazing, mean)

  Grazed Ungrazed 
 67.9405  50.8805

library(ggplot2)
ggplot(df1) +
  geom_boxplot(aes(Fruit, Grazing))

What hypothesis test?

t.test(Fruit ~ Grazing, data = df1, alternative = c("greater"))

How about the root diameter…

Q2: what is the influence of grazing and root diameter on the fruit production of a plant?

Dependent variable:

fruit production (mg)

Independent variables:

grazing (categorical: grazed or ungrazed)
root diameter (continuous, mm, covariate)

ggplot(df1, aes(Root, Fruit))+
  geom_point() +
  geom_smooth(method = 'lm') +
  geom_point(aes(color = Grazing)) + 
  geom_smooth(aes(color = Grazing), method = 'lm')

The initial root size of each plant is different
The grazed plants have bigger root sizes
The regression lines have a same slope and two different intercepts

Three possible factors:

Root size (intercept)
Grazing (intercept)
Interaction between root size and grazing (slope)

3.2 Maximal model

Symbol	Meaning
`~`	Separating DV (left) and IV (right)
`:`	Interaction effect of two factors
`*`	Main effect of the two factors and the interaction effect. `f1 * f2` is equivalent to `f1 + f2 + f1:f2`
`^`	Square the sum of several terms. The main effect of these terms and the interaction between them
`.`	All variables except the DV

# The maximal model
df1_ancova <- lm(Fruit ~ Grazing * Root, data = df1)
summary(df1_ancova)


Call:
lm(formula = Fruit ~ Grazing * Root, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.3177  -2.8320   0.1247   3.8511  17.1313 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -125.173     12.811  -9.771 1.15e-11 ***
GrazingUngrazed        30.806     16.842   1.829   0.0757 .  
Root                   23.240      1.531  15.182  < 2e-16 ***
GrazingUngrazed:Root    0.756      2.354   0.321   0.7500    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.831 on 36 degrees of freedom
Multiple R-squared:  0.9293,    Adjusted R-squared:  0.9234 
F-statistic: 157.6 on 3 and 36 DF,  p-value: < 2.2e-16

# The ANOVA table for the maximal model
anova(df1_ancova)

Analysis of Variance Table

Response: Fruit
             Df  Sum Sq Mean Sq  F value    Pr(>F)    
Grazing       1  2910.4  2910.4  62.3795 2.262e-09 ***
Root          1 19148.9 19148.9 410.4201 < 2.2e-16 ***
Grazing:Root  1     4.8     4.8   0.1031      0.75    
Residuals    36  1679.6    46.7                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# other method to see the ANOVA table 
df1_aov <- aov(Fruit ~ Grazing * Root, data = df1)
summary(df1_aov)

             Df Sum Sq Mean Sq F value   Pr(>F)    
Grazing       1   2910    2910  62.380 2.26e-09 ***
Root          1  19149   19149 410.420  < 2e-16 ***
Grazing:Root  1      5       5   0.103     0.75    
Residuals    36   1680      47                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.aov(df1_ancova)

             Df Sum Sq Mean Sq F value   Pr(>F)    
Grazing       1   2910    2910  62.380 2.26e-09 ***
Root          1  19149   19149 410.420  < 2e-16 ***
Grazing:Root  1      5       5   0.103     0.75    
Residuals    36   1680      47                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.3 Minimal model

# Delete the interaction factor 
df1_ancova2 <- update(df1_ancova, ~ . - Grazing:Root)
summary(df1_ancova2)


Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1920  -2.8224   0.3223   3.9144  17.3290 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -127.829      9.664  -13.23 1.35e-15 ***
GrazingUngrazed   36.103      3.357   10.75 6.11e-13 ***
Root              23.560      1.149   20.51  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.747 on 37 degrees of freedom
Multiple R-squared:  0.9291,    Adjusted R-squared:  0.9252 
F-statistic: 242.3 on 2 and 37 DF,  p-value: < 2.2e-16

# Compare the simplified model with the maximal model
anova(df1_ancova, df1_ancova2)

Analysis of Variance Table

Model 1: Fruit ~ Grazing * Root
Model 2: Fruit ~ Grazing + Root
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     36 1679.7                           
2     37 1684.5 -1   -4.8122 0.1031   0.75

\(p = 0.75 > 0.05\):

the two models have no significant difference at \(\alpha = 0.05\)
the model simplification was justified.

# Delete the grazing factor 
df1_ancova3 <- update(df1_ancova2, ~ . - Grazing)
summary(df1_ancova3)


Call:
lm(formula = Fruit ~ Root, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.3844 -10.4447  -0.7574  10.7606  23.7556 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.286     10.723  -3.850 0.000439 ***
Root          14.022      1.463   9.584  1.1e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.52 on 38 degrees of freedom
Multiple R-squared:  0.7073,    Adjusted R-squared:  0.6996 
F-statistic: 91.84 on 1 and 38 DF,  p-value: 1.099e-11

# Compare the two models
anova(df1_ancova2, df1_ancova3)

Analysis of Variance Table

Model 1: Fruit ~ Grazing + Root
Model 2: Fruit ~ Root
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     37 1684.5                                  
2     38 6948.8 -1   -5264.4 115.63 6.107e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(p < 0.05\):

removing the Grazing factor causes a significant change to the model.
The effect of grazing on fruit production is highly significant and needs to be retained in the model.
df1_ancova2 is the minimal adequate model.

summary(df1_ancova2)


Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1920  -2.8224   0.3223   3.9144  17.3290 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -127.829      9.664  -13.23 1.35e-15 ***
GrazingUngrazed   36.103      3.357   10.75 6.11e-13 ***
Root              23.560      1.149   20.51  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.747 on 37 degrees of freedom
Multiple R-squared:  0.9291,    Adjusted R-squared:  0.9252 
F-statistic: 242.3 on 2 and 37 DF,  p-value: < 2.2e-16

anova(df1_ancova2)

Analysis of Variance Table

Response: Fruit
          Df  Sum Sq Mean Sq F value    Pr(>F)    
Grazing    1  2910.4  2910.4  63.929 1.397e-09 ***
Root       1 19148.9 19148.9 420.616 < 2.2e-16 ***
Residuals 37  1684.5    45.5                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Grazing is associated with a 36.103 mg reduction in fruit production.

3.4 One step

Criterion: Akaike’s information criterion (AIC). The model is worse if AIC gets greater.

step(df1_ancova)

Start:  AIC=157.5
Fruit ~ Grazing * Root

               Df Sum of Sq    RSS    AIC
- Grazing:Root  1    4.8122 1684.5 155.61
<none>                      1679.7 157.50

Step:  AIC=155.61
Fruit ~ Grazing + Root

          Df Sum of Sq     RSS    AIC
<none>                  1684.5 155.61
- Grazing  1    5264.4  6948.8 210.30
- Root     1   19148.9 20833.4 254.22


Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)

Coefficients:
    (Intercept)  GrazingUngrazed             Root  
        -127.83            36.10            23.56

3.5 Results

equatiomatic::extract_eq(df1_ancova2, use_coefs = TRUE)

\[ \operatorname{\widehat{Fruit}} = -127.83 + 36.1(\operatorname{Grazing}_{\operatorname{Ungrazed}}) + 23.56(\operatorname{Root}) \]

stargazer::stargazer(df1_ancova2, type = 'html')


	Dependent variable:

	Fruit

GrazingUngrazed	36.103^***
	(3.357)

Root	23.560^***
	(1.149)

Constant	-127.829^***
	(9.664)


Observations	40
R²	0.929
Adjusted R²	0.925
Residual Std. Error	6.747 (df = 37)
F Statistic	242.272^*** (df = 2; 37)

Note:	p<0.1; p<0.05; p<0.01

4 Two-way ANCOVA

Dependent variables: One continuous variable
Independent variables: Two categorical variables, one or multiple continuous variables (covariate)

4.1 Question

Previous experiments have shown that both genotype and sex of an organism affect body weight gain. However, a scientist believes that after adjusting for age, there was no significant difference in means of weight gain between groups at different levels of sex and Genotype. Can experiments support this claim?¹

¹ The R book, Chapter 12.3

Dependent variable:
- Weight gain (continuous)
Independent variables:
- genotype (categorical)
- sex (categorical)
- age (covariate)

Gain <- read.table("data/Gain.txt", header = T)

4.2 Models

m1 <- lm(Weight ~ Sex * Age * Genotype, data = Gain)
summary(m1)


Call:
lm(formula = Weight ~ Sex * Age * Genotype, data = Gain)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40218 -0.12043 -0.01065  0.12592  0.44687 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 7.80053    0.24941  31.276  < 2e-16 ***
Sexmale                    -0.51966    0.35272  -1.473  0.14936    
Age                         0.34950    0.07520   4.648 4.39e-05 ***
GenotypeCloneB              1.19870    0.35272   3.398  0.00167 ** 
GenotypeCloneC             -0.41751    0.35272  -1.184  0.24429    
GenotypeCloneD              0.95600    0.35272   2.710  0.01023 *  
GenotypeCloneE             -0.81604    0.35272  -2.314  0.02651 *  
GenotypeCloneF              1.66851    0.35272   4.730 3.41e-05 ***
Sexmale:Age                -0.11283    0.10635  -1.061  0.29579    
Sexmale:GenotypeCloneB     -0.31716    0.49882  -0.636  0.52891    
Sexmale:GenotypeCloneC     -1.06234    0.49882  -2.130  0.04010 *  
Sexmale:GenotypeCloneD     -0.73547    0.49882  -1.474  0.14906    
Sexmale:GenotypeCloneE     -0.28533    0.49882  -0.572  0.57087    
Sexmale:GenotypeCloneF     -0.19839    0.49882  -0.398  0.69319    
Age:GenotypeCloneB         -0.10146    0.10635  -0.954  0.34643    
Age:GenotypeCloneC         -0.20825    0.10635  -1.958  0.05799 .  
Age:GenotypeCloneD         -0.01757    0.10635  -0.165  0.86970    
Age:GenotypeCloneE         -0.03825    0.10635  -0.360  0.72123    
Age:GenotypeCloneF         -0.05512    0.10635  -0.518  0.60743    
Sexmale:Age:GenotypeCloneB  0.15469    0.15040   1.029  0.31055    
Sexmale:Age:GenotypeCloneC  0.35322    0.15040   2.349  0.02446 *  
Sexmale:Age:GenotypeCloneD  0.19227    0.15040   1.278  0.20929    
Sexmale:Age:GenotypeCloneE  0.13203    0.15040   0.878  0.38585    
Sexmale:Age:GenotypeCloneF  0.08709    0.15040   0.579  0.56616    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2378 on 36 degrees of freedom
Multiple R-squared:  0.9742,    Adjusted R-squared:  0.9577 
F-statistic: 59.06 on 23 and 36 DF,  p-value: < 2.2e-16

m2 <- step(m1)

Start:  AIC=-155.01
Weight ~ Sex * Age * Genotype

                   Df Sum of Sq    RSS     AIC
- Sex:Age:Genotype  5   0.34912 2.3849 -155.51
<none>                          2.0358 -155.01

Step:  AIC=-155.51
Weight ~ Sex + Age + Genotype + Sex:Age + Sex:Genotype + Age:Genotype

               Df Sum of Sq    RSS     AIC
- Sex:Genotype  5  0.146901 2.5318 -161.92
- Age:Genotype  5  0.168136 2.5531 -161.42
- Sex:Age       1  0.048937 2.4339 -156.29
<none>                      2.3849 -155.51

Step:  AIC=-161.92
Weight ~ Sex + Age + Genotype + Sex:Age + Age:Genotype

               Df Sum of Sq    RSS     AIC
- Age:Genotype  5  0.168136 2.7000 -168.07
- Sex:Age       1  0.048937 2.5808 -162.78
<none>                      2.5318 -161.92

Step:  AIC=-168.07
Weight ~ Sex + Age + Genotype + Sex:Age

           Df Sum of Sq    RSS      AIC
- Sex:Age   1     0.049  2.749 -168.989
<none>                   2.700 -168.066
- Genotype  5    54.958 57.658    5.612

Step:  AIC=-168.99
Weight ~ Sex + Age + Genotype

           Df Sum of Sq    RSS      AIC
<none>                   2.749 -168.989
- Sex       1    10.374 13.122  -77.201
- Age       1    10.770 13.519  -75.415
- Genotype  5    54.958 57.707    3.662

The slope of weight gain against age does not vary with sex or genotype
The intercepts vary
The minimal adequate model: three main effects but no interactions

summary(m2)


Call:
lm(formula = Weight ~ Sex + Age + Genotype, data = Gain)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40005 -0.15120 -0.01668  0.16953  0.49227 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     7.93701    0.10066  78.851  < 2e-16 ***
Sexmale        -0.83161    0.05937 -14.008  < 2e-16 ***
Age             0.29958    0.02099  14.273  < 2e-16 ***
GenotypeCloneB  0.96778    0.10282   9.412 8.07e-13 ***
GenotypeCloneC -1.04361    0.10282 -10.149 6.21e-14 ***
GenotypeCloneD  0.82396    0.10282   8.013 1.21e-10 ***
GenotypeCloneE -0.87540    0.10282  -8.514 1.98e-11 ***
GenotypeCloneF  1.53460    0.10282  14.925  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2299 on 52 degrees of freedom
Multiple R-squared:  0.9651,    Adjusted R-squared:  0.9604 
F-statistic: 205.7 on 7 and 52 DF,  p-value: < 2.2e-16

Can we reduce the genotypes’ factor level from the present six to four without losing any explanatory power?

newGenotype <- as.factor(Gain$Genotype)
levels(newGenotype)

[1] "CloneA" "CloneB" "CloneC" "CloneD" "CloneE" "CloneF"

levels(newGenotype)[c(3,5)] <- "ClonesCandE"
levels(newGenotype)[c(2,4)] <- "ClonesBandD"
levels(newGenotype)

[1] "CloneA"      "ClonesBandD" "ClonesCandE" "CloneF"

m3 <- lm(data=Gain,Weight~Sex+Age+newGenotype)
anova(m2,m3)

Analysis of Variance Table

Model 1: Weight ~ Sex + Age + Genotype
Model 2: Weight ~ Sex + Age + newGenotype
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     52 2.7489                           
2     54 2.9938 -2  -0.24489 2.3163 0.1087

4.3 Results

As \(p =0.1087\), there is no significant difference between the two models. Therefore, the new model uses NewGenotype (four levels) instead of Genotype (six levels).

summary(m3)


Call:
lm(formula = Weight ~ Sex + Age + newGenotype, data = Gain)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42651 -0.16687  0.01211  0.18776  0.47736 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             7.93701    0.10308  76.996  < 2e-16 ***
Sexmale                -0.83161    0.06080 -13.679  < 2e-16 ***
Age                     0.29958    0.02149  13.938  < 2e-16 ***
newGenotypeClonesBandD  0.89587    0.09119   9.824 1.28e-13 ***
newGenotypeClonesCandE -0.95950    0.09119 -10.522 1.10e-14 ***
newGenotypeCloneF       1.53460    0.10530  14.574  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2355 on 54 degrees of freedom
Multiple R-squared:  0.962, Adjusted R-squared:  0.9585 
F-statistic: 273.7 on 5 and 54 DF,  p-value: < 2.2e-16

equatiomatic::extract_eq(m3, use_coefs = TRUE)

\[ \operatorname{\widehat{Weight}} = 7.94 - 0.83(\operatorname{Sex}_{\operatorname{male}}) + 0.3(\operatorname{Age}) + 0.9(\operatorname{newGenotype}_{\operatorname{ClonesBandD}}) - 0.96(\operatorname{newGenotype}_{\operatorname{ClonesCandE}}) + 1.53(\operatorname{newGenotype}_{\operatorname{CloneF}}) \]

stargazer::stargazer(m3, type = 'html')


	Dependent variable:

	Weight

Sexmale	-0.832^***
	(0.061)

Age	0.300^***
	(0.021)

newGenotypeClonesBandD	0.896^***
	(0.091)

newGenotypeClonesCandE	-0.960^***
	(0.091)

newGenotypeCloneF	1.535^***
	(0.105)

Constant	7.937^***
	(0.103)


Observations	60
R²	0.962
Adjusted R²	0.959
Residual Std. Error	0.235 (df = 54)
F Statistic	273.652^*** (df = 5; 54)

Note:	p<0.1; p<0.05; p<0.01

4.4 Visualization

plot(Weight~Age,data=Gain,type="n")
colours <- c("green","red","black","blue")
lines <- c(1,2)
symbols <- c(16,17)
NewSex<-as.factor(Gain$Sex)
points(Weight~Age,data=Gain,pch=symbols[as.numeric(NewSex)],
       col=colours[as.numeric(newGenotype)])
xv <- c(1,5)
for (i in 1:2) {
  for (j in 1:4){
    a <- coef(m3)[1]+(i>1)* coef(m3)[2]+(j>1)*coef(m3)[j+2]
 
    b <- coef(m3)[3]
    yv <- a+b*xv
    lines(xv,yv,lty=lines[i],col=colours[j]) } }

5 Further readings

The R Book, Chapter 12
Linear Models with R, Chapter 13

6 Exercises

The effect of a drug on the birth weight of baby mice was studied. Pregnant mice were divided into four groups, and each group was treated with different doses (0, 5, 50 or 500). The gestation and the birth weight of baby mice were recorded. The dataset is available as litter in the multcomp package.

data(litter, package="multcomp")
help('litter', package="multcomp")

Does the drug dose have effect on the weight?

Carry out ANOVA with the drug dose as the independent variable and weight as the dependent variable. What is your conclusion?
Carry out ANCOVa with the drug dose as the independent variable, weight as the dependent variable, and gestation as the covariate. What is your conclusion?