Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
Learning objectives
What is one-way and two-way MANOVA.
The hypotheses and usages of MANOVA.
The meanings of the assumptions of MANOVA.
Principle
Definition
Univariate Analysis of Variance (ANOVA):
one dependent variable (continuous) ~ one or multiple independent variables (categorical).
Multivariate Analysis of Variance (MANOVA):
multiple dependent variables (continuous) ~ one or multiple independent variables (categorical). Comparing multivariate sample means. It uses the covariance between outcome variables in testing the statistical significance of the mean differences when there are multiple dependent variables.
Hypothesis:
H 0 : the group mean vectors are the same for all groups (the levels of the categorical variable has no influence on the numerical variables)
Why MANOVA
Reduce the Type I error
Control for inter-correlations among the multiple dependent variables
Uses more information from the dependent variables i.e., MANOVA may find differences between groups based on combined information from the multiple dependent variables.
Example: Influence of teaching methods on student satisfaction scores and exam scores.
1 https://statisticsbyjim.com/anova/multivariate-anova-manova-benefits-use/
dtf <- read.csv ('data/teaching_methods.csv' )
summary (aov (Test ~ Method, data = dtf))
Df Sum Sq Mean Sq F value Pr(>F)
Method 1 0.000578 0.0005780 2.426 0.126
Residuals 46 0.010958 0.0002382
summary (aov (Satisfaction ~ Method, data = dtf))
Df Sum Sq Mean Sq F value Pr(>F)
Method 1 0.000032 0.0000320 0.135 0.715
Residuals 46 0.010944 0.0002379
library (ggplot2)
library (tidyr)
dtf |>
pivot_longer (- Method) |>
ggplot () +
geom_dotplot (aes (x = Method, y = value, group = Method), binaxis = "y" , stackdir = "center" ) +
facet_wrap (name~ .)
par (mfrow = c (1 , 3 ))
boxplot (Test ~ Method, data = dtf)
boxplot (Satisfaction ~ Method, data = dtf)
plot (dtf$ Satisfaction, dtf$ Test, col = dtf$ Method, pch = 16 , xlab = 'Satisfaction' , ylab = 'Test' )
tm_manova <- manova (cbind (dtf$ Test, dtf$ Satisfaction) ~ dtf$ Method)
summary (tm_manova)
Df Pillai approx F num Df den Df Pr(>F)
dtf$Method 1 0.45766 18.987 2 45 1.05e-06 ***
Residuals 46
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis test: F-test, converted from
Pillai’s Trace test (most robust to departures from assumptions),
Wilk’s Lambda test
Roy’s Largest Root test
Hotelling-Lawley’s test
Assumptions
Multivariate normality
Homogeneity of the variance-covariance matrix
If the multivariate normality or the homogeneity of the variance–covariance matrix is not satisfied, use the Wilks Lambda test.
No outliers in dependent variables
Linearly relation for each group of the independent variables
No multicollinearity among dependent variables
One-way MANOVA
Multiple dependent numerical variables ~ one independent categorical variable
Example: The iris dataset. Do the species have influence on the sepal size?
Visualization:
library (ggplot2)
library (tidyr)
iris[, c ('Species' , 'Sepal.Length' , 'Sepal.Width' )] |>
pivot_longer (cols = c (Sepal.Length, Sepal.Width)) |>
ggplot () +
geom_boxplot (aes (Species, value, fill = name)) +
labs (y = 'Size (cm)' , fill = '' )
library (gplots)
par (mfrow = c (1 , 2 ))
plotmeans (iris$ Sepal.Length ~ iris$ Species)
plotmeans (iris$ Sepal.Width ~ iris$ Species)
Hypothesis:
H 0 : The population means of the sepal length and the sepal width are not different across the species.
Collect data
# multivariate normality test
SepalSize <- cbind (iris$ Sepal.Length, iris$ Sepal.Width)
# or
SepalSize <- as.matrix (iris[, c ('Sepal.Length' , 'Sepal.Width' )])
Calculate the test statistic: the Pillai’s Trace test or the F-test
iris_manova <- manova (SepalSize ~ iris$ Species)
summary (iris_manova)
Df Pillai approx F num Df den Df Pr(>F)
iris$Species 2 0.94531 65.878 4 294 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary (iris_manova, test = 'Pillai' )
Df Pillai approx F num Df den Df Pr(>F)
iris$Species 2 0.94531 65.878 4 294 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary (iris_manova, test = 'Wilks' )
Df Wilks approx F num Df den Df Pr(>F)
iris$Species 2 0.16654 105.88 4 292 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary (iris_manova, test = 'Roy' )
Df Roy approx F num Df den Df Pr(>F)
iris$Species 2 4.1718 306.63 2 147 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary (iris_manova, test = 'Hotelling-Lawley' )
Df Hotelling-Lawley approx F num Df den Df Pr(>F)
iris$Species 2 4.3328 157.06 4 290 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Univariate ANOVAs for each dependent variable:
Response Sepal.Length :
Df Sum Sq Mean Sq F value Pr(>F)
iris$Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response Sepal.Width :
Df Sum Sq Mean Sq F value Pr(>F)
iris$Species 2 11.345 5.6725 49.16 < 2.2e-16 ***
Residuals 147 16.962 0.1154
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: The species has a statistically significant effect on the sepal width and sepal length.
Post-hoc test
After MANOVA gives a significant result, which group(s) is/are different from other(s)?
Linear Discriminant Analysis (LDA)
library (MASS)
iris_lda <- lda (iris$ Species ~ SepalSize, CV = FALSE )
plot_lda <- data.frame (Species = iris$ Species,
lda = predict (iris_lda)$ x)
ggplot (plot_lda) +
geom_point (aes (x = lda.LD1, y = lda.LD2, colour = Species))
Conclusion: The sepal size of the setosa species is different from other species.
Test the assumptions
Multivariate normality
Univariate normality: Shapiro-Wilk test
H 0 : The variable follows a normal distribution.
tapply (iris$ Sepal.Length, iris$ Species, shapiro.test)
$setosa
Shapiro-Wilk normality test
data: X[[i]]
W = 0.9777, p-value = 0.4595
$versicolor
Shapiro-Wilk normality test
data: X[[i]]
W = 0.97784, p-value = 0.4647
$virginica
Shapiro-Wilk normality test
data: X[[i]]
W = 0.97118, p-value = 0.2583
tapply (iris$ Sepal.Width, iris$ Species, shapiro.test)
$setosa
Shapiro-Wilk normality test
data: X[[i]]
W = 0.97172, p-value = 0.2715
$versicolor
Shapiro-Wilk normality test
data: X[[i]]
W = 0.97413, p-value = 0.338
$virginica
Shapiro-Wilk normality test
data: X[[i]]
W = 0.96739, p-value = 0.1809
library (rstatix)
iris |>
group_by (Species) |>
shapiro_test (Sepal.Length, Sepal.Width)
# A tibble: 6 × 4
Species variable statistic p
<fct> <chr> <dbl> <dbl>
1 setosa Sepal.Length 0.978 0.460
2 setosa Sepal.Width 0.972 0.272
3 versicolor Sepal.Length 0.978 0.465
4 versicolor Sepal.Width 0.974 0.338
5 virginica Sepal.Length 0.971 0.258
6 virginica Sepal.Width 0.967 0.181
Tip:
If the sample size is large (say n > 50), the visual approaches such as QQ-plot and histogram will be better for assessing the normality assumption.
iris[, c ('Species' , 'Sepal.Length' , 'Sepal.Width' )] |>
pivot_longer (cols = c (Sepal.Length, Sepal.Width)) |>
ggplot () +
geom_histogram (aes (value)) +
facet_grid (name ~ Species)
Conclusion: As \(p > 0.05\) , the sepal length and the width for each species are normally distributed.
Multivariate normality: Mardia’s skewness and kurtosis test
H 0 : The variables follow a multivariate normal distribution.
library (mvnormalTest)
mardia (iris[, c ('Sepal.Length' , 'Sepal.Width' )])$ mv.test
Test Statistic p-value Result
1 Skewness 9.4614 0.0505 YES
2 Kurtosis -0.691 0.4896 YES
3 MV Normality <NA> <NA> YES
Conclusion: As \(p > 0.05\) , the sepal length and the width follow a multivariate normal distribution.
Tip:
When n > 20 for each combination of the independent and dependent variable, the multivariate normality can be assumed (Multivariate Central Limit Theorem).
Homogeneity of the variance-covariance matrix
Box’s M test : Use a lower \(\alpha\) level such as \(\alpha = 0.001\) to assess the p value for significance.
Hypothesis:
H 0 : The variance-covariance matrices are equal for each combination formed by each group in the independent variable.
library (biotools)
boxM (iris[, c ('Sepal.Length' , 'Sepal.Width' )], iris$ Species)
Box's M-test for Homogeneity of Covariance Matrices
data: iris[, c("Sepal.Length", "Sepal.Width")]
Chi-Sq (approx.) = 35.655, df = 6, p-value = 3.217e-06
Conclusion: As \(p < 0.001\) , the variance-covariance matrices for the sepal length and width are not equal for each combination formed by each species.
Which variable fails in equal variance? Use the Bartlett’s or Levene’s test for testing the homogeneity of variance assumption.
Multivariate outliers
MANOVA is highly sensitive to outliers and may produce Type I or II errors.
Multivariate outliers can be detected using the Mahalanobis Distance test. The larger the Mahalanobis Distance, the more likely it is an outlier.
library (rstatix)
iris_outlier <- mahalanobis_distance (iris[, c ('Sepal.Length' , 'Sepal.Width' )])
Linearity
Visualize the pairwise scatterplot for the dependent variable for each group:
ggplot (iris, aes (x = Sepal.Length, y = Sepal.Width)) +
geom_point () +
geom_smooth (method = 'lm' ) +
facet_wrap (Species ~ .)
Or test the regression or the slope (see ENV221).
Multicollinearity
Correlation between the dependent variable.
For three or more dependent variables, use a correlation matrix or variance inflation factor (VIF).
cor.test (x = iris$ Sepal.Length, y = iris$ Sepal.Width)
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Sepal.Width
t = -1.4403, df = 148, p-value = 0.1519
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.27269325 0.04351158
sample estimates:
cor
-0.1175698
ggplot (iris, aes (Sepal.Length, Sepal.Width)) +
geom_point () +
geom_smooth (method = 'lm' )
If \(| r |\) > 0.9, there is multicollinearity.
If r is too low, perform separate univariate ANOVA for each dependent variable.
Two-way MANOVA
Multiple dependent numerical variables ~ two independent categorical variables
Example: Plastic. Do the rate of extrusion and the additive have influence on the plastic quality?
data ('Plastic' , package = 'heplots' )
Plastic_matrix <- as.matrix (Plastic[, c ('tear' ,'gloss' ,'opacity' )])
Plastic_manova <- manova (Plastic_matrix ~ Plastic$ rate * Plastic$ additive)
summary (Plastic_manova)
Df Pillai approx F num Df den Df Pr(>F)
Plastic$rate 1 0.61814 7.5543 3 14 0.003034 **
Plastic$additive 1 0.47697 4.2556 3 14 0.024745 *
Plastic$rate:Plastic$additive 1 0.22289 1.3385 3 14 0.301782
Residuals 16
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Each of the three response variables separately:
summary.aov (Plastic_manova)
Response tear :
Df Sum Sq Mean Sq F value Pr(>F)
Plastic$rate 1 1.7405 1.74050 15.7868 0.001092 **
Plastic$additive 1 0.7605 0.76050 6.8980 0.018330 *
Plastic$rate:Plastic$additive 1 0.0005 0.00050 0.0045 0.947143
Residuals 16 1.7640 0.11025
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response gloss :
Df Sum Sq Mean Sq F value Pr(>F)
Plastic$rate 1 1.3005 1.30050 7.9178 0.01248 *
Plastic$additive 1 0.6125 0.61250 3.7291 0.07139 .
Plastic$rate:Plastic$additive 1 0.5445 0.54450 3.3151 0.08740 .
Residuals 16 2.6280 0.16425
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response opacity :
Df Sum Sq Mean Sq F value Pr(>F)
Plastic$rate 1 0.421 0.4205 0.1036 0.7517
Plastic$additive 1 4.901 4.9005 1.2077 0.2881
Plastic$rate:Plastic$additive 1 3.960 3.9605 0.9760 0.3379
Residuals 16 64.924 4.0578
Conclusion: There are significant main effects for both rate and additive at \(\alpha = 0.05\) , but no interaction. The opacity is not significantly associated with either of the explanatory variables.
Further readings
Applied Statistics for Environmental Science. Chapter 6
https://www.r-bloggers.com/2021/11/manovamultivariate-analysis-of-variance-using-r/#google_vignette
Exercises
The iris dataset. Do the species have influence on the petal size? Carry out MANOVA with the post-hoc test and give your conclusion.