Principal Component Analysis

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning Objectives

  1. Understand the concept of Principal Component Analysis (PCA)
  2. Understand the algorithm principal and theory of PCA
  3. Learn how to use R for PCA and apply it to data analysis

2 Data transformation

2.1 Scaling

Normalization Standardization
\(x_{norm} = \frac{x -x_{min}}{x_{max} - x_{min}}\) \(z = \frac{x-\mu}{\sigma}\)
Minimum and maximum value of features are used for scaling Mean and standard deviation is used for scaling.
When features are of different scales. When we want to ensure zero mean and unit standard deviation.
\(0, 1\) Not bounded to a certain range.
Much affected by outliers. Less affected by outliers.
It is useful when we don’t know about the distribution It is useful when the feature distribution is Normal or Gaussian.

3 Principle

3.1 Definition

Component:
Principal Component Analysis (PCA):

A dimensionality reduction technique that enables identification of correlations and patterns in a data set so that it can be transformed into a data set of significantly lower dimension without loss of any important information.

x <- iris[1:7, 1:2]
xpc <- princomp(x, cor=TRUE, score=TRUE)
par(mfrow = c(1,2))
plot(x$Sepal.Width, x$Sepal.Length, pch = as.character(1:7))
biplot(xpc)

3.2 Procedure

  1. Move the center of the coordinate axis to the center of the data, and then rotate the coordinate axis to maximize the variance of the data projected on the C1 axis. C1 is called “the first principal component”. That is, the projection of all the data in this direction is the most scattered.
  2. Find an axis (C2), so that the covariance (correlation coefficient) between C2 and C1 is 0, so as to avoid overlapping with C1 information. Make the variance of data in this direction as large as possible. C2 is called “the second principal component”.
  3. Similarly, find the third principal component, the fourth principal component… The \(n^{th}\) principal component. \(n\) random variables can have n principal components.

3.3 Model

3.3.1 Standardization

Suppose there are \(m\) variables for principal component analysis: \(x_{1}\),\(x_{2}\),…,\(x_{m}\) and there are \(n\) evaluation objects. The \(j^{th}\) index of the \(i^{th}\) evaluation object is \(a_{ij}\), and the evaluation matrix is \(A = (a_{ij})_{n \times m}\).

Convert each index value into standard index \(\tilde{a}_{ij}\):

\[\tilde{a}_{ij}=\frac{a_{ij}-\mu_{j}}{s_{j}}\]

  • \(i=1,2,\cdots,n; j=1,2,\cdots,m\)
  • \(\mu_j=\frac{1}{n}\sum_{i=1}^{n} a_{ij}\), standard mean of the \(j^{th}\) indicator.
  • \(s_j = \frac{1}{n-1}\sum_{i=1}^{n} (a_{ij}-\mu_j)^2\), standard sample standard deviation of the \(j^{th}\) indicator.

Conversely, standardized index variables:

\[\tilde{x_i}=\frac{x_i-\mu_i}{s_i} (i= 1,2,\cdots, m)\]

The standardized evaluation matrix is

\[\tilde{A}=(\tilde{a}_{ij})_ {n \times m}\]

3.3.2 Correlation coefficient matrix

Correlation coefficient matrix:

\[R=r_{ij}=\frac{\sum_{k = 1}^n \tilde{a_ {ki}} \cdot \tilde{a_ {kj}}}{n - 1} (i,j = 1,2,\cdots,m)\]

  • \(r_{ii}=1\)
  • \(r_{ij}=r_{ji}\)
  • \(r_{ij}\): the correlation coefficient between the \(i^{th}\) index and the \(j^{th}\) index.

3.3.3 Eigenvalues and eigenvectors

Calculate the eigenvalues of the correlation coefficient matrix \(R\) \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_m \geq 0\) and the corresponding eigenvectors \(\mu_1,\mu_2,\cdots,\mu_m\),among \(\mu_j = (\mu_ {1j}, \mu_ {2j}, \cdots, \mu_ {mj}) ^ T\), is made up of characteristic vector \(m\) a new indicator variables

\[y_1=\mu_{11} \tilde{x_{1}}+\mu_{21} \tilde{x_{2}}+\cdots+\mu_{m1} \tilde{x_m}\]

\[y_2=\mu_{12} \tilde{x_{1}}+\mu_{22} \tilde{x_{2}}+\cdots+\mu_{m2} \tilde{x_m}\]

\[y_m=\mu_{1m} \tilde{x_{1}}+\mu_{2m} \tilde{x_{2}}+\cdots+\mu_{mm} \tilde{x_m}\]

  • \(y_i\): the \(i^{th}\) principal component.

\(Y = \tilde {X} ^ T \mu\)

  • \(\tilde {X} = (\tilde (x_1), \tilde {x_2}, \cdots, \tilde {x_m}) ^ T\)
  • \(Y = (y_{1}, y_{2}, \cdots, y_{m}) ^ T\)
  • \(\mu = (\mu_1, \mu_2, \cdots \mu_m)\).

The normalized evaluation matrix \(\tilde{A}\) is transformed into A new evaluation matrix \(\tilde{B}\) :

\[\tilde{B}= \tilde{A}\mu\]

3.3.4 Selection

Calculate the information and cumulative contribution rates of the eigenvalues \(\lambda_j \ (j = 1,2,\cdots,m)\).

The information of main component \(y_{j}\) contribution rate:

\[b_{j} = \frac{\lambda_{j}} {\sum_{k=1}^{m} \lambda_{k}}, (j = 1, 2,\cdots,m)\]

The cumulative contribution rate of main components \(y_1,y_2,\cdots,y_p\):

\[\alpha_p = \frac{\sum_{k = 1}^{p} \lambda_k} {\sum_{k=1}^{m} \lambda_k}\]

When \(\alpha_p\) approaches 1 (\(\alpha_p = 0.85/0.90/0.95\)), the first \(p\) index variables \(y_1,y_2,\cdots, y_p\) are selected as \(p\) principal components instead of the original \(m\) index variables. After selecting \(p\) principal components, the new evaluation matrix \(\tilde{C}\) is:

\[\tilde{C} = \tilde{B}_{(n \times p)}\]

3.3.5 Comprehensive evaluation value

The overall score:

\[z = \sum_{j=1}^pb_jy_j\]

  • \(b_j\) is the \(j^{th}\) principal component

Then the comprehensive evaluation value vector of \(n\) evaluation objects:

\[Z = \tilde{C}b\]

  • \(b = (b_1,b_2,\cdots,b_p)^T\).

Then the \(n\) objects can be evaluated comprehensively according to the evaluation value vector.

3.3.6 Usages in R

  • prcomp()
com1 <- prcomp(iris[,1:4], center = TRUE, scale. = TRUE) 
summary(com1)
  • princomp()

Normalize the data first before using it:

  • It has no normalized parameters and uses covariance matrix by default, resulting in slightly different results than using correlation matrix.

  • The normalized correlation coefficient is almost equal to the covariance.

dt <- as.matrix(scale(iris[, 1:4])) #normalization
com2 <- princomp(dt, cor = T)
summary(com2)
com3 <- princomp(dt)
summary(com3)

Warning: only applies to matrices with more rows than columns.

  • Visualization
library(ggplot2)
pc_score <- com1$x #get the pc score
pc_score <- data.frame(pc_score, iris$Species)
summ <- summary(com1)
xlab <- paste0("PC1(", round(summ$importance[2, 1] * 100, 2), "%)")
ylab <- paste0("PC2(", round(summ$importance[2, 2] * 100, 2), "%)")
ggplot(data = pc_score, aes(x = PC1, y = PC2, color = iris.Species)) +
  geom_point() + 
  labs(x = xlab, y = ylab) +
  stat_ellipse(
    aes(fill = iris.Species),
    type = "norm",
    geom = "polygon",
    alpha = 0.2,
    color = NA
  )

4 Workflow

4.1 Data

Standardization: unify the dimensions of data and centralize the data.

dt <- as.matrix(scale(iris[, 1:4]))

4.2 Correlation coefficient (covariance) matrix

rm1 <- cor(dt)

4.3 Eigenvalues and eigenvectors

rs1 <- eigen(rm1)
# Extract the eigenvalue in the result, that is, the variance of each principal component
val <- rs1$values
# Convert to standard deviation
Standard_deviation <- sqrt(val)
# Calculate variance contribution rate and cumulative contribution rate
Proportion_of_Variance <- val / sum(val)
Cumulative_Proportion <- cumsum(Proportion_of_Variance)
# Draw scree plot
plot(val, 
     type = "b",  las = 1, 
     xlab = "principal component number", ylab = "Principal component variance")

4.4 Principal component score

# Feature vectors in extraction results
U <- as.matrix(rs1$vectors)
# Perform matrix multiplication to obtain PC score
PC <- dt %*% U
colnames(PC) <- c("PC1","PC2","PC3","PC4")
df <- data.frame(PC, iris$Species)

4.5 Draw the main component dispersion point diagram

library(ggplot2)
# Extract the variance contribution rate of the principal component and generate the coordinate axis title
xlab <- paste0("PC1 (", round(Proportion_of_Variance[1] * 100, 2), "%)")
ylab <- paste0("PC2 (", round(Proportion_of_Variance[2] * 100, 2), "%)")
# Draw a scatter plot and add a confidence ellipse
p <- 
  ggplot(data = df, aes(x = PC1, y = PC2, color = iris.Species)) +
  geom_point() + labs(x = xlab, y = ylab) +
  stat_ellipse(
    aes(fill = iris.Species),
    type = "norm",
    geom = "polygon",
    alpha = 0.2,
    color = NA
  )

4.6 One step

com <- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE)
df1 <- data.frame(com$x, iris$Species)

5 Further readings