Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, September 22, 2016

prcomp and svd for Principle Component Analysis (PCA)

Principle Component Analysis(PCA) is a very important skill for dimention reduction to analyze high-dimentional data. High-dimentional data are data with features (p) a lot more than observations (n). This types of data are very commonly generated from high-throuput sequencing experiments. For example, an RNA-seq or microarry experiment measures expression of tens of thousands of genes for only 8 samples (4 controls and 4 treatments).
Let’s use a microarray data for demonstration. One thing to note is that in linear algebra, a matrix is coded as n (rows are observations) X p (columns are features).However, in the microarray/RNA-seq case, the matrix is represented as p (rows are features/genes) X n (columns are observations/samples). That’s why we need to transpose the matrix before feeding the matrix to prcomp or SVD.
library(ISLR)

# transpose the data
ncidat<- t(NCI60$data)
colnames(ncidat)<- NCI60$labs

dim(ncidat)
## [1] 6830   64
## it is a data matrix with 64 columns (different tissues) and 6830 rows (genes)
ncidat[1:6,1:6]
##      CNS      CNS    CNS         RENAL BREAST    CNS
## 1  0.300 0.679961  0.940  2.800000e-01  0.485  0.310
## 2  1.180 1.289961 -0.040 -3.100000e-01 -0.465 -0.030
## 3  0.550 0.169961 -0.170  6.800000e-01  0.395 -0.100
## 4  1.140 0.379961 -0.040 -8.100000e-01  0.905 -0.460
## 5 -0.265 0.464961 -0.605  6.250000e-01  0.200 -0.205
## 6 -0.070 0.579961  0.000 -1.387779e-17 -0.005 -0.540

Use prcomp

The default R package stats comes with function prcomp() to perform principal component analysis. This means that we don’t need to install anything (although there are other options using external packages).
We take the transpose of ncidat because prcomp assumes: units/samples in row and features (genes) in columns.
## please look at the help page to see the meanings of  center and scale. parameters.
## center and scale can affect the result a lot. Usually we center the data.

pca_prcomp<- prcomp(t(ncidat), center = TRUE, scale. = FALSE)
How much variantion is explained by each component:
plot(pca_prcomp)
summary(pca_prcomp)
## Importance of components:
##                            PC1      PC2      PC3      PC4      PC5
## Standard deviation     25.1638 18.78637 16.73078 13.53082 12.78895
## Proportion of Variance  0.1489  0.08301  0.06584  0.04306  0.03847
## Cumulative Proportion   0.1489  0.23194  0.29777  0.34083  0.37930
##                             PC6      PC7      PC8      PC9    PC10   PC11
## Standard deviation     12.21052 11.05840 10.94492 10.59140 9.57657 9.4493
## Proportion of Variance  0.03507  0.02876  0.02817  0.02638 0.02157 0.0210
## Cumulative Proportion   0.41437  0.44313  0.47130  0.49769 0.51926 0.5403
##                           PC12   PC13    PC14    PC15   PC16    PC17
## Standard deviation     9.22659 8.8220 8.66863 8.43185 8.2746 8.19031
## Proportion of Variance 0.02002 0.0183 0.01767 0.01672 0.0161 0.01578
## Cumulative Proportion  0.56028 0.5786 0.59626 0.61298 0.6291 0.64486
##                           PC18    PC19    PC20    PC21    PC22   PC23
## Standard deviation     7.86272 7.84561 7.74753 7.64167 7.41462 7.3207
## Proportion of Variance 0.01454 0.01448 0.01412 0.01373 0.01293 0.0126
## Cumulative Proportion  0.65940 0.67388 0.68799 0.70173 0.71466 0.7273
##                           PC24   PC25    PC26    PC27   PC28    PC29
## Standard deviation     7.09512 7.0817 6.86791 6.71857 6.6190 6.57295
## Proportion of Variance 0.01184 0.0118 0.01109 0.01062 0.0103 0.01016
## Cumulative Proportion  0.73910 0.7509 0.76199 0.77261 0.7829 0.79307
##                           PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     6.50142 6.38411 6.31688 6.10274 6.07004 5.96433
## Proportion of Variance 0.00994 0.00959 0.00938 0.00876 0.00867 0.00837
## Cumulative Proportion  0.80302 0.81260 0.82199 0.83075 0.83941 0.84778
##                           PC36    PC37    PC38    PC39    PC40   PC41
## Standard deviation     5.93322 5.87286 5.82866 5.67923 5.63268 5.5707
## Proportion of Variance 0.00828 0.00811 0.00799 0.00759 0.00746 0.0073
## Cumulative Proportion  0.85606 0.86417 0.87216 0.87975 0.88721 0.8945
##                           PC42   PC43    PC44    PC45   PC46    PC47
## Standard deviation     5.51265 5.4555 5.37942 5.32142 5.1743 5.14470
## Proportion of Variance 0.00715 0.0070 0.00681 0.00666 0.0063 0.00623
## Cumulative Proportion  0.90165 0.9086 0.91546 0.92212 0.9284 0.93464
##                           PC48    PC49    PC50    PC51   PC52    PC53
## Standard deviation     5.01790 4.82436 4.77879 4.69168 4.5637 4.49039
## Proportion of Variance 0.00592 0.00547 0.00537 0.00518 0.0049 0.00474
## Cumulative Proportion  0.94057 0.94604 0.95141 0.95659 0.9615 0.96623
##                           PC54   PC55    PC56    PC57    PC58    PC59
## Standard deviation     4.41142 4.2741 4.21355 4.08613 3.91956 3.78810
## Proportion of Variance 0.00458 0.0043 0.00418 0.00393 0.00361 0.00337
## Cumulative Proportion  0.97081 0.9751 0.97928 0.98320 0.98682 0.99019
##                           PC60    PC61    PC62   PC63      PC64
## Standard deviation     3.52405 3.22882 3.15278 2.9856 1.341e-14
## Proportion of Variance 0.00292 0.00245 0.00234 0.0021 0.000e+00
## Cumulative Proportion  0.99311 0.99557 0.99790 1.0000 1.000e+00
#sdev refers to the standard deviation of principal components.
pca_prcomp$sdev
##  [1] 2.516378e+01 1.878637e+01 1.673078e+01 1.353082e+01 1.278895e+01
##  [6] 1.221052e+01 1.105840e+01 1.094492e+01 1.059140e+01 9.576574e+00
## [11] 9.449313e+00 9.226590e+00 8.821954e+00 8.668634e+00 8.431849e+00
## [16] 8.274578e+00 8.190308e+00 7.862721e+00 7.845612e+00 7.747529e+00
## [21] 7.641665e+00 7.414624e+00 7.320674e+00 7.095120e+00 7.081674e+00
## [26] 6.867907e+00 6.718573e+00 6.618968e+00 6.572955e+00 6.501420e+00
## [31] 6.384107e+00 6.316878e+00 6.102743e+00 6.070035e+00 5.964333e+00
## [36] 5.933221e+00 5.872856e+00 5.828663e+00 5.679232e+00 5.632684e+00
## [41] 5.570718e+00 5.512649e+00 5.455510e+00 5.379416e+00 5.321422e+00
## [46] 5.174272e+00 5.144699e+00 5.017899e+00 4.824356e+00 4.778789e+00
## [51] 4.691679e+00 4.563740e+00 4.490394e+00 4.411423e+00 4.274070e+00
## [56] 4.213548e+00 4.086132e+00 3.919560e+00 3.788098e+00 3.524054e+00
## [61] 3.228818e+00 3.152782e+00 2.985601e+00 1.341106e-14
## variance explained by each PC cumulatively
cumsum(pca_prcomp$sdev^2)/sum(pca_prcomp$sdev^2)
##  [1] 0.1489294 0.2319364 0.2977720 0.3408323 0.3793002 0.4143671 0.4431287
##  [8] 0.4713030 0.4976867 0.5192567 0.5402571 0.5602793 0.5785838 0.5962576
## [15] 0.6129791 0.6290826 0.6448598 0.6594001 0.6738773 0.6879947 0.7017289
## [22] 0.7146592 0.7272638 0.7391037 0.7508988 0.7619925 0.7726091 0.7829132
## [29] 0.7930745 0.8030158 0.8126016 0.8219866 0.8307461 0.8394120 0.8477786
## [36] 0.8560582 0.8641702 0.8721606 0.8797465 0.8872086 0.8945074 0.9016548
## [43] 0.9086548 0.9154609 0.9221211 0.9284180 0.9346431 0.9405652 0.9460392
## [50] 0.9514103 0.9565874 0.9614860 0.9662284 0.9708055 0.9751019 0.9792776
## [57] 0.9832045 0.9868178 0.9901928 0.9931137 0.9955657 0.9979035 1.0000000
## [64] 1.0000000
what’s in the prca_prcomp object
names(pca_prcomp)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
## the first two PCs for the first 6 tissues
head(pca_prcomp$x[,1:2])
##              PC1         PC2
## CNS    -19.79578   0.1152691
## CNS    -21.54610  -1.4573503
## CNS    -25.05662   1.5260929
## RENAL  -37.40954 -11.3894784
## BREAST -50.21864  -1.3461737
## CNS    -26.43520   0.4629819
PC1_and_PC2<- data.frame(PC1=pca_prcomp$x[,1], PC2= pca_prcomp$x[,2], type = rownames(pca_prcomp$x))

## plot PCA plot
library(ggplot2)
ggplot(PC1_and_PC2, aes(x=PC1, y=PC2, col=type)) + geom_point() + geom_text(aes(label = type), hjust=0, vjust=0)
#This returns each principal components loadings
pca_prcomp$rotation[1:6, 1:6]
##            PC1           PC2         PC3          PC4          PC5
## 1 -0.005096247  0.0009839929 0.002116058  0.007628801 -0.012118316
## 2 -0.001642354  0.0034355664 0.008049350  0.004910196 -0.007412249
## 3 -0.002509243 -0.0015838271 0.004746350  0.008769557 -0.012426296
## 4  0.004940063  0.0078435347 0.013716214  0.011378816 -0.014851587
## 5 -0.003365039 -0.0002680479 0.010677453 -0.005249648 -0.003317312
## 6  0.001382038 -0.0034431320 0.003167842  0.007425971 -0.002879251
##             PC6
## 1  0.0061392242
## 2  0.0114429879
## 3 -0.0002860562
## 4 -0.0065009935
## 5 -0.0003513787
## 6  0.0003210365

Use singluar value decomposition

in a svd analysis, a matrix n x p matrix X is decomposed by X = U*D*V: 1.U is an m×n orthogonal matrix. 2.V is an n×n orthogonal matrix. 3.D is an n×n diagonal matrix.
PCs: Z = XV or Z = UD (U are un-scaled PCs)
Some facts of PCA:
k th column of Z, Z(k), is the k th PC.(the k th pattern)
PC loadings: V k th column of V, V(k) is the k th PC loading (feature weights). aka. the k th column of V encodes the associated k th pattern in feature space.
PC loadings: U k th column of U, U(k) is the k th PC loading (observation weights). aka. the k th column of U encodes the associated k th pattern in observation space.
Diagnal matrix: D diagnals in D: d(k) gives the strength of the k th pattern.
Variance explained by k th PC: d(k)^2 Total variance of the data: sum(d(k1)^2 + d(k2)^2 + …..d(k)^2+….)
proportion of variane explained by k th PC: d(k)^2 / sum(d(k1)^2 + d(k2)^2 + …..d(k)^2+….)
X<- t(scale(t(ncidat),center=TRUE,scale=FALSE))
# we transpose X again for svd
# usually there is no need to transpose the matrix and then transpose it back, but svd was written that rows are observations and columns are 
# features.however, most genomic data represent observations (e.g. samples) in columns and features (e.g. genes) in columns.

sv<- svd(t(X))
U<- sv$u
V<- sv$v
D<- sv$d


## U are un-scaled PC, Z is scaled
Z<- t(X)%*%V

## PCs
Z[1:6, 1:6]
##             [,1]        [,2]       [,3]      [,4]       [,5]     [,6]
## CNS    -19.79578   0.1152691  -5.968917  4.753293  -4.882164 18.92591
## CNS    -21.54610  -1.4573503  -9.019584  6.767942  -2.247604 17.07273
## CNS    -25.05662   1.5260929  -6.959653  2.785913 -10.819648 16.45389
## RENAL  -37.40954 -11.3894784  -5.407097 15.442094 -16.011475 33.09651
## BREAST -50.21864  -1.3461737 -17.599944 15.099862 -13.852847 16.94340
## CNS    -26.43520   0.4629819 -16.931456 11.389195  -6.742920 11.85838
## is the same as
pca_prcomp$x[1:6, 1:6]
##              PC1         PC2        PC3       PC4        PC5      PC6
## CNS    -19.79578   0.1152691  -5.968917  4.753293  -4.882164 18.92591
## CNS    -21.54610  -1.4573503  -9.019584  6.767942  -2.247604 17.07273
## CNS    -25.05662   1.5260929  -6.959653  2.785913 -10.819648 16.45389
## RENAL  -37.40954 -11.3894784  -5.407097 15.442094 -16.011475 33.09651
## BREAST -50.21864  -1.3461737 -17.599944 15.099862 -13.852847 16.94340
## CNS    -26.43520   0.4629819 -16.931456 11.389195  -6.742920 11.85838
## PC loadings
V[1:6, 1:6]
##              [,1]          [,2]        [,3]         [,4]         [,5]
## [1,] -0.005096247  0.0009839929 0.002116058  0.007628801 -0.012118316
## [2,] -0.001642354  0.0034355664 0.008049350  0.004910196 -0.007412249
## [3,] -0.002509243 -0.0015838271 0.004746350  0.008769557 -0.012426296
## [4,]  0.004940063  0.0078435347 0.013716214  0.011378816 -0.014851587
## [5,] -0.003365039 -0.0002680479 0.010677453 -0.005249648 -0.003317312
## [6,]  0.001382038 -0.0034431320 0.003167842  0.007425971 -0.002879251
##               [,6]
## [1,]  0.0061392242
## [2,]  0.0114429879
## [3,] -0.0002860562
## [4,] -0.0065009935
## [5,] -0.0003513787
## [6,]  0.0003210365
## is the same as 
pca_prcomp$rotation[1:6, 1:6]
##            PC1           PC2         PC3          PC4          PC5
## 1 -0.005096247  0.0009839929 0.002116058  0.007628801 -0.012118316
## 2 -0.001642354  0.0034355664 0.008049350  0.004910196 -0.007412249
## 3 -0.002509243 -0.0015838271 0.004746350  0.008769557 -0.012426296
## 4  0.004940063  0.0078435347 0.013716214  0.011378816 -0.014851587
## 5 -0.003365039 -0.0002680479 0.010677453 -0.005249648 -0.003317312
## 6  0.001382038 -0.0034431320 0.003167842  0.007425971 -0.002879251
##             PC6
## 1  0.0061392242
## 2  0.0114429879
## 3 -0.0002860562
## 4 -0.0065009935
## 5 -0.0003513787
## 6  0.0003210365
# plot PC1 vs PC2

pc_dat<- data.frame(type = rownames(Z), PC1 = Z[,1], PC2= Z[,2])
ggplot(pc_dat,aes(x=PC1, y=PC2, col=type)) + geom_point() + geom_text(aes(label = type), hjust=0, vjust=0)
We get the same results as from the prcomp function.

Variance explained for each PC

varex = 0
cumvar = 0
denom = sum(D^2)
for(i in 1:length(D)){
  varex[i] = D[i]^2/denom
  cumvar[i] = sum(D[1:i]^2)/denom
}

## variance explained by each PC cumulatively
cumvar
##  [1] 0.1489294 0.2319364 0.2977720 0.3408323 0.3793002 0.4143671 0.4431287
##  [8] 0.4713030 0.4976867 0.5192567 0.5402571 0.5602793 0.5785838 0.5962576
## [15] 0.6129791 0.6290826 0.6448598 0.6594001 0.6738773 0.6879947 0.7017289
## [22] 0.7146592 0.7272638 0.7391037 0.7508988 0.7619925 0.7726091 0.7829132
## [29] 0.7930745 0.8030158 0.8126016 0.8219866 0.8307461 0.8394120 0.8477786
## [36] 0.8560582 0.8641702 0.8721606 0.8797465 0.8872086 0.8945074 0.9016548
## [43] 0.9086548 0.9154609 0.9221211 0.9284180 0.9346431 0.9405652 0.9460392
## [50] 0.9514103 0.9565874 0.9614860 0.9662284 0.9708055 0.9751019 0.9792776
## [57] 0.9832045 0.9868178 0.9901928 0.9931137 0.9955657 0.9979035 1.0000000
## [64] 1.0000000
It is the same as the result of cumsum(pca_prcomp$sdev^2)/sum(pca_prcomp$sdev^2) above.
Screeplot
par(mfrow=c(1,2))
plot(1:length(D),varex,type="l",lwd=2,xlab="PC",ylab="% Variance Explained")
plot(1:length(D),cumvar,type="l",lwd=2,xlab="PC",ylab="Cummulative Variance Explained")