Analysis of covariance (ANCOVA)

Numercial Vs. categorical, adapted for covariate

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objectives

  1. What is a covariate, and why we need ANCOVA.
  2. Carry out ANCOVA for a scientific question.
  3. Visualize the ANCOVA model.

2 Principle

2.1 Definition

Analysis of covariance (ANCOVA):

ANOVA tuned with linear regression.

Dependent variable (DV): One continuous variable

Independent variables (IVs): One or multiple categorical variables, one or multiple continuous variables (covariate, CV)

Analysis for the differences among group means for a linear combination of the dependent variable after adjusted for the covariate

Test whether the independent variable(s) has a significant influence on the dependent variable, excluding the influence of the covariate (preferably highly correlated with the dependent variable)

Covariate (CV):

An independent variable that is not manipulated by experimenters but still influences experimental results.

Example: The influence of the sex and age on the body weight.

2.2 Model

\[Y_{ij} = (\mu+\tau_{i})+\beta(x_{ij}-\bar{x})+\epsilon_{ij}\]

  • \(Y_{ij}\): the j-th observation under the i-th categorical group
  • \(\mu\): the population mean
  • \(i\): groups, 1,2, …, \(r\)
  • \(j\): observations, 1,2,…, \(n_i\)
  • \(\tau_i\): an adjustment to the \(y\) intercept for the i-th regression line
  • \(\mu + \tau_i\): the intercept for group \(i\)
  • \(\beta\): the slope of the line
  • \(x_{ij}\): the j-th observation of the continuous variable under the i-th group
  • \(\bar x\): the global mean of the variable \(x\)
  • \(\epsilon _{ij}\): the associated unobserved error

Example: The influence of the sex (S, m for male and f for female) and age (A) on the body weight (W).

\[W_m = a_m + b_m A\]

\[W_f = a_f + b_f A\] Model simplification:

3 One-way ANCOVA

3.1 Questions

Q1: Does grazing have influence on the fruit production? Are grazed plants have more fruit production than ungrazed ones?

  • Dependent variable: Fruit production (continuous)
  • Independent variable: Grazing (categorical)
df1 <- read.table("data/ipomopsis.txt", header = TRUE, stringsAsFactors = TRUE)
tapply(df1$Fruit,df1$Grazing, mean)
  Grazed Ungrazed 
 67.9405  50.8805 
library(ggplot2)
ggplot(df1) +
  geom_boxplot(aes(Fruit, Grazing))

What hypothesis test?
t.test(Fruit ~ Grazing, data = df1, alternative = c("greater"))

How about the root diameter…

Q2: what is the influence of grazing and root diameter on the fruit production of a plant?

Dependent variable:

  • fruit production (mg)

Independent variables:

  • grazing (categorical: grazed or ungrazed)
  • root diameter (continuous, mm, covariate)
ggplot(df1, aes(Root, Fruit))+
  geom_point() +
  geom_smooth(method = 'lm') +
  geom_point(aes(color = Grazing)) + 
  geom_smooth(aes(color = Grazing), method = 'lm')

  1. The initial root size of each plant is different
  2. The grazed plants have bigger root sizes
  3. The regression lines have a same slope and two different intercepts

Three possible factors:

  1. Root size (intercept)
  2. Grazing (intercept)
  3. Interaction between root size and grazing (slope)

3.2 Maximal model

Symbol Meaning
~ Separating DV (left) and IV (right)
: Interaction effect of two factors
* Main effect of the two factors and the interaction effect. f1 * f2 is equivalent to f1 + f2 + f1:f2
^ Square the sum of several terms. The main effect of these terms and the interaction between them
. All variables except the DV
# The maximal model
df1_ancova <- lm(Fruit ~ Grazing * Root, data = df1)
summary(df1_ancova)

Call:
lm(formula = Fruit ~ Grazing * Root, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.3177  -2.8320   0.1247   3.8511  17.1313 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -125.173     12.811  -9.771 1.15e-11 ***
GrazingUngrazed        30.806     16.842   1.829   0.0757 .  
Root                   23.240      1.531  15.182  < 2e-16 ***
GrazingUngrazed:Root    0.756      2.354   0.321   0.7500    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.831 on 36 degrees of freedom
Multiple R-squared:  0.9293,    Adjusted R-squared:  0.9234 
F-statistic: 157.6 on 3 and 36 DF,  p-value: < 2.2e-16
# The ANOVA table for the maximal model
anova(df1_ancova)
Analysis of Variance Table

Response: Fruit
             Df  Sum Sq Mean Sq  F value    Pr(>F)    
Grazing       1  2910.4  2910.4  62.3795 2.262e-09 ***
Root          1 19148.9 19148.9 410.4201 < 2.2e-16 ***
Grazing:Root  1     4.8     4.8   0.1031      0.75    
Residuals    36  1679.6    46.7                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# other method to see the ANOVA table 
df1_aov <- aov(Fruit ~ Grazing * Root, data = df1)
summary(df1_aov)
             Df Sum Sq Mean Sq F value   Pr(>F)    
Grazing       1   2910    2910  62.380 2.26e-09 ***
Root          1  19149   19149 410.420  < 2e-16 ***
Grazing:Root  1      5       5   0.103     0.75    
Residuals    36   1680      47                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(df1_ancova)
             Df Sum Sq Mean Sq F value   Pr(>F)    
Grazing       1   2910    2910  62.380 2.26e-09 ***
Root          1  19149   19149 410.420  < 2e-16 ***
Grazing:Root  1      5       5   0.103     0.75    
Residuals    36   1680      47                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.3 Minimal model

# Delete the interaction factor 
df1_ancova2 <- update(df1_ancova, ~ . - Grazing:Root)
summary(df1_ancova2)

Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1920  -2.8224   0.3223   3.9144  17.3290 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -127.829      9.664  -13.23 1.35e-15 ***
GrazingUngrazed   36.103      3.357   10.75 6.11e-13 ***
Root              23.560      1.149   20.51  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.747 on 37 degrees of freedom
Multiple R-squared:  0.9291,    Adjusted R-squared:  0.9252 
F-statistic: 242.3 on 2 and 37 DF,  p-value: < 2.2e-16
# Compare the simplified model with the maximal model
anova(df1_ancova, df1_ancova2)
Analysis of Variance Table

Model 1: Fruit ~ Grazing * Root
Model 2: Fruit ~ Grazing + Root
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     36 1679.7                           
2     37 1684.5 -1   -4.8122 0.1031   0.75

\(p = 0.75 > 0.05\):

  • the two models have no significant difference at \(\alpha = 0.05\)
  • the model simplification was justified.
# Delete the grazing factor 
df1_ancova3 <- update(df1_ancova2, ~ . - Grazing)
summary(df1_ancova3)

Call:
lm(formula = Fruit ~ Root, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.3844 -10.4447  -0.7574  10.7606  23.7556 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.286     10.723  -3.850 0.000439 ***
Root          14.022      1.463   9.584  1.1e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.52 on 38 degrees of freedom
Multiple R-squared:  0.7073,    Adjusted R-squared:  0.6996 
F-statistic: 91.84 on 1 and 38 DF,  p-value: 1.099e-11
# Compare the two models
anova(df1_ancova2, df1_ancova3)
Analysis of Variance Table

Model 1: Fruit ~ Grazing + Root
Model 2: Fruit ~ Root
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     37 1684.5                                  
2     38 6948.8 -1   -5264.4 115.63 6.107e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(p < 0.05\):

  • removing the Grazing factor causes a significant change to the model.
  • The effect of grazing on fruit production is highly significant and needs to be retained in the model.
  • df1_ancova2 is the minimal adequate model.
summary(df1_ancova2)

Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1920  -2.8224   0.3223   3.9144  17.3290 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -127.829      9.664  -13.23 1.35e-15 ***
GrazingUngrazed   36.103      3.357   10.75 6.11e-13 ***
Root              23.560      1.149   20.51  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.747 on 37 degrees of freedom
Multiple R-squared:  0.9291,    Adjusted R-squared:  0.9252 
F-statistic: 242.3 on 2 and 37 DF,  p-value: < 2.2e-16
anova(df1_ancova2)
Analysis of Variance Table

Response: Fruit
          Df  Sum Sq Mean Sq F value    Pr(>F)    
Grazing    1  2910.4  2910.4  63.929 1.397e-09 ***
Root       1 19148.9 19148.9 420.616 < 2.2e-16 ***
Residuals 37  1684.5    45.5                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Grazing is associated with a 36.103 mg reduction in fruit production.

3.4 One step

Criterion: Akaike’s information criterion (AIC). The model is worse if AIC gets greater.

step(df1_ancova)
Start:  AIC=157.5
Fruit ~ Grazing * Root

               Df Sum of Sq    RSS    AIC
- Grazing:Root  1    4.8122 1684.5 155.61
<none>                      1679.7 157.50

Step:  AIC=155.61
Fruit ~ Grazing + Root

          Df Sum of Sq     RSS    AIC
<none>                  1684.5 155.61
- Grazing  1    5264.4  6948.8 210.30
- Root     1   19148.9 20833.4 254.22

Call:
lm(formula = Fruit ~ Grazing + Root, data = df1)

Coefficients:
    (Intercept)  GrazingUngrazed             Root  
        -127.83            36.10            23.56  

3.5 Results

equatiomatic::extract_eq(df1_ancova2, use_coefs = TRUE)

\[ \operatorname{\widehat{Fruit}} = -127.83 + 36.1(\operatorname{Grazing}_{\operatorname{Ungrazed}}) + 23.56(\operatorname{Root}) \]

stargazer::stargazer(df1_ancova2, type = 'html')
Dependent variable:
Fruit
GrazingUngrazed 36.103***
(3.357)
Root 23.560***
(1.149)
Constant -127.829***
(9.664)
Observations 40
R2 0.929
Adjusted R2 0.925
Residual Std. Error 6.747 (df = 37)
F Statistic 242.272*** (df = 2; 37)
Note: p<0.1; p<0.05; p<0.01

4 Two-way ANCOVA

  • Dependent variables: One continuous variable
  • Independent variables: Two categorical variables, one or multiple continuous variables (covariate)

4.1 Question

Previous experiments have shown that both genotype and sex of an organism affect body weight gain. However, a scientist believes that after adjusting for age, there was no significant difference in means of weight gain between groups at different levels of sex and Genotype. Can experiments support this claim?1

  • 1 The R book, Chapter 12.3

    • Dependent variable:
      • Weight gain (continuous)
    • Independent variables:
      • genotype (categorical)
      • sex (categorical)
      • age (covariate)
    Gain <- read.table("data/Gain.txt", header = T)

    4.2 Models

    m1 <- lm(Weight ~ Sex * Age * Genotype, data = Gain)
    summary(m1)
    
    Call:
    lm(formula = Weight ~ Sex * Age * Genotype, data = Gain)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.40218 -0.12043 -0.01065  0.12592  0.44687 
    
    Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
    (Intercept)                 7.80053    0.24941  31.276  < 2e-16 ***
    Sexmale                    -0.51966    0.35272  -1.473  0.14936    
    Age                         0.34950    0.07520   4.648 4.39e-05 ***
    GenotypeCloneB              1.19870    0.35272   3.398  0.00167 ** 
    GenotypeCloneC             -0.41751    0.35272  -1.184  0.24429    
    GenotypeCloneD              0.95600    0.35272   2.710  0.01023 *  
    GenotypeCloneE             -0.81604    0.35272  -2.314  0.02651 *  
    GenotypeCloneF              1.66851    0.35272   4.730 3.41e-05 ***
    Sexmale:Age                -0.11283    0.10635  -1.061  0.29579    
    Sexmale:GenotypeCloneB     -0.31716    0.49882  -0.636  0.52891    
    Sexmale:GenotypeCloneC     -1.06234    0.49882  -2.130  0.04010 *  
    Sexmale:GenotypeCloneD     -0.73547    0.49882  -1.474  0.14906    
    Sexmale:GenotypeCloneE     -0.28533    0.49882  -0.572  0.57087    
    Sexmale:GenotypeCloneF     -0.19839    0.49882  -0.398  0.69319    
    Age:GenotypeCloneB         -0.10146    0.10635  -0.954  0.34643    
    Age:GenotypeCloneC         -0.20825    0.10635  -1.958  0.05799 .  
    Age:GenotypeCloneD         -0.01757    0.10635  -0.165  0.86970    
    Age:GenotypeCloneE         -0.03825    0.10635  -0.360  0.72123    
    Age:GenotypeCloneF         -0.05512    0.10635  -0.518  0.60743    
    Sexmale:Age:GenotypeCloneB  0.15469    0.15040   1.029  0.31055    
    Sexmale:Age:GenotypeCloneC  0.35322    0.15040   2.349  0.02446 *  
    Sexmale:Age:GenotypeCloneD  0.19227    0.15040   1.278  0.20929    
    Sexmale:Age:GenotypeCloneE  0.13203    0.15040   0.878  0.38585    
    Sexmale:Age:GenotypeCloneF  0.08709    0.15040   0.579  0.56616    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.2378 on 36 degrees of freedom
    Multiple R-squared:  0.9742,    Adjusted R-squared:  0.9577 
    F-statistic: 59.06 on 23 and 36 DF,  p-value: < 2.2e-16
    m2 <- step(m1)
    Start:  AIC=-155.01
    Weight ~ Sex * Age * Genotype
    
                       Df Sum of Sq    RSS     AIC
    - Sex:Age:Genotype  5   0.34912 2.3849 -155.51
    <none>                          2.0358 -155.01
    
    Step:  AIC=-155.51
    Weight ~ Sex + Age + Genotype + Sex:Age + Sex:Genotype + Age:Genotype
    
                   Df Sum of Sq    RSS     AIC
    - Sex:Genotype  5  0.146901 2.5318 -161.92
    - Age:Genotype  5  0.168136 2.5531 -161.42
    - Sex:Age       1  0.048937 2.4339 -156.29
    <none>                      2.3849 -155.51
    
    Step:  AIC=-161.92
    Weight ~ Sex + Age + Genotype + Sex:Age + Age:Genotype
    
                   Df Sum of Sq    RSS     AIC
    - Age:Genotype  5  0.168136 2.7000 -168.07
    - Sex:Age       1  0.048937 2.5808 -162.78
    <none>                      2.5318 -161.92
    
    Step:  AIC=-168.07
    Weight ~ Sex + Age + Genotype + Sex:Age
    
               Df Sum of Sq    RSS      AIC
    - Sex:Age   1     0.049  2.749 -168.989
    <none>                   2.700 -168.066
    - Genotype  5    54.958 57.658    5.612
    
    Step:  AIC=-168.99
    Weight ~ Sex + Age + Genotype
    
               Df Sum of Sq    RSS      AIC
    <none>                   2.749 -168.989
    - Sex       1    10.374 13.122  -77.201
    - Age       1    10.770 13.519  -75.415
    - Genotype  5    54.958 57.707    3.662
    • The slope of weight gain against age does not vary with sex or genotype
    • The intercepts vary
    • The minimal adequate model: three main effects but no interactions
    summary(m2)
    
    Call:
    lm(formula = Weight ~ Sex + Age + Genotype, data = Gain)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.40005 -0.15120 -0.01668  0.16953  0.49227 
    
    Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
    (Intercept)     7.93701    0.10066  78.851  < 2e-16 ***
    Sexmale        -0.83161    0.05937 -14.008  < 2e-16 ***
    Age             0.29958    0.02099  14.273  < 2e-16 ***
    GenotypeCloneB  0.96778    0.10282   9.412 8.07e-13 ***
    GenotypeCloneC -1.04361    0.10282 -10.149 6.21e-14 ***
    GenotypeCloneD  0.82396    0.10282   8.013 1.21e-10 ***
    GenotypeCloneE -0.87540    0.10282  -8.514 1.98e-11 ***
    GenotypeCloneF  1.53460    0.10282  14.925  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.2299 on 52 degrees of freedom
    Multiple R-squared:  0.9651,    Adjusted R-squared:  0.9604 
    F-statistic: 205.7 on 7 and 52 DF,  p-value: < 2.2e-16

    Can we reduce the genotypes’ factor level from the present six to four without losing any explanatory power?

    newGenotype <- as.factor(Gain$Genotype)
    levels(newGenotype)
    [1] "CloneA" "CloneB" "CloneC" "CloneD" "CloneE" "CloneF"
    levels(newGenotype)[c(3,5)] <- "ClonesCandE"
    levels(newGenotype)[c(2,4)] <- "ClonesBandD"
    levels(newGenotype)
    [1] "CloneA"      "ClonesBandD" "ClonesCandE" "CloneF"     
    m3 <- lm(data=Gain,Weight~Sex+Age+newGenotype)
    anova(m2,m3)
    Analysis of Variance Table
    
    Model 1: Weight ~ Sex + Age + Genotype
    Model 2: Weight ~ Sex + Age + newGenotype
      Res.Df    RSS Df Sum of Sq      F Pr(>F)
    1     52 2.7489                           
    2     54 2.9938 -2  -0.24489 2.3163 0.1087

    4.3 Results

    As \(p =0.1087\), there is no significant difference between the two models. Therefore, the new model uses NewGenotype (four levels) instead of Genotype (six levels).

    summary(m3)
    
    Call:
    lm(formula = Weight ~ Sex + Age + newGenotype, data = Gain)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.42651 -0.16687  0.01211  0.18776  0.47736 
    
    Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
    (Intercept)             7.93701    0.10308  76.996  < 2e-16 ***
    Sexmale                -0.83161    0.06080 -13.679  < 2e-16 ***
    Age                     0.29958    0.02149  13.938  < 2e-16 ***
    newGenotypeClonesBandD  0.89587    0.09119   9.824 1.28e-13 ***
    newGenotypeClonesCandE -0.95950    0.09119 -10.522 1.10e-14 ***
    newGenotypeCloneF       1.53460    0.10530  14.574  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.2355 on 54 degrees of freedom
    Multiple R-squared:  0.962, Adjusted R-squared:  0.9585 
    F-statistic: 273.7 on 5 and 54 DF,  p-value: < 2.2e-16
    equatiomatic::extract_eq(m3, use_coefs = TRUE)

    \[ \operatorname{\widehat{Weight}} = 7.94 - 0.83(\operatorname{Sex}_{\operatorname{male}}) + 0.3(\operatorname{Age}) + 0.9(\operatorname{newGenotype}_{\operatorname{ClonesBandD}}) - 0.96(\operatorname{newGenotype}_{\operatorname{ClonesCandE}}) + 1.53(\operatorname{newGenotype}_{\operatorname{CloneF}}) \]

    stargazer::stargazer(m3, type = 'html')
    Dependent variable:
    Weight
    Sexmale -0.832***
    (0.061)
    Age 0.300***
    (0.021)
    newGenotypeClonesBandD 0.896***
    (0.091)
    newGenotypeClonesCandE -0.960***
    (0.091)
    newGenotypeCloneF 1.535***
    (0.105)
    Constant 7.937***
    (0.103)
    Observations 60
    R2 0.962
    Adjusted R2 0.959
    Residual Std. Error 0.235 (df = 54)
    F Statistic 273.652*** (df = 5; 54)
    Note: p<0.1; p<0.05; p<0.01

    4.4 Visualization

    plot(Weight~Age,data=Gain,type="n")
    colours <- c("green","red","black","blue")
    lines <- c(1,2)
    symbols <- c(16,17)
    NewSex<-as.factor(Gain$Sex)
    points(Weight~Age,data=Gain,pch=symbols[as.numeric(NewSex)],
           col=colours[as.numeric(newGenotype)])
    xv <- c(1,5)
    for (i in 1:2) {
      for (j in 1:4){
        a <- coef(m3)[1]+(i>1)* coef(m3)[2]+(j>1)*coef(m3)[j+2]
     
        b <- coef(m3)[3]
        yv <- a+b*xv
        lines(xv,yv,lty=lines[i],col=colours[j]) } }

    5 Further readings

    • The R Book, Chapter 12
    • Linear Models with R, Chapter 13

    6 Exercises

    The effect of a drug on the birth weight of baby mice was studied. Pregnant mice were divided into four groups, and each group was treated with different doses (0, 5, 50 or 500). The gestation and the birth weight of baby mice were recorded. The dataset is available as litter in the multcomp package.

    data(litter, package="multcomp")
    help('litter', package="multcomp")

    Does the drug dose have effect on the weight?

    1. Carry out ANOVA with the drug dose as the independent variable and weight as the dependent variable. What is your conclusion?
    2. Carry out ANCOVa with the drug dose as the independent variable, weight as the dependent variable, and gestation as the covariate. What is your conclusion?