Thursday, September 22, 2016

prcomp and svd for Principle Component Analysis (PCA)

Principle Component Analysis(PCA) is a very important skill for dimention reduction to analyze high-dimentional data. High-dimentional data are data with features (p) a lot more than observations (n). This types of data are very commonly generated from high-throuput sequencing experiments. For example, an RNA-seq or microarry experiment measures expression of tens of thousands of genes for only 8 samples (4 controls and 4 treatments).
Let’s use a microarray data for demonstration. One thing to note is that in linear algebra, a matrix is coded as n (rows are observations) X p (columns are features).However, in the microarray/RNA-seq case, the matrix is represented as p (rows are features/genes) X n (columns are observations/samples). That’s why we need to transpose the matrix before feeding the matrix to `prcomp` or `SVD`.
``````library(ISLR)

# transpose the data
ncidat<- t(NCI60\$data)
colnames(ncidat)<- NCI60\$labs

dim(ncidat)``````
``## [1] 6830   64``
``````## it is a data matrix with 64 columns (different tissues) and 6830 rows (genes)
ncidat[1:6,1:6]``````
``````##      CNS      CNS    CNS         RENAL BREAST    CNS
## 1  0.300 0.679961  0.940  2.800000e-01  0.485  0.310
## 2  1.180 1.289961 -0.040 -3.100000e-01 -0.465 -0.030
## 3  0.550 0.169961 -0.170  6.800000e-01  0.395 -0.100
## 4  1.140 0.379961 -0.040 -8.100000e-01  0.905 -0.460
## 5 -0.265 0.464961 -0.605  6.250000e-01  0.200 -0.205
## 6 -0.070 0.579961  0.000 -1.387779e-17 -0.005 -0.540``````

Use `prcomp`

The default R package stats comes with function `prcomp()` to perform principal component analysis. This means that we don’t need to install anything (although there are other options using external packages).
We take the transpose of `ncidat` because prcomp assumes: units/samples in row and features (genes) in columns.
``````## please look at the help page to see the meanings of  center and scale. parameters.
## center and scale can affect the result a lot. Usually we center the data.

pca_prcomp<- prcomp(t(ncidat), center = TRUE, scale. = FALSE)``````
How much variantion is explained by each component:
``plot(pca_prcomp)``
``summary(pca_prcomp)``
``````## Importance of components:
##                            PC1      PC2      PC3      PC4      PC5
## Standard deviation     25.1638 18.78637 16.73078 13.53082 12.78895
## Proportion of Variance  0.1489  0.08301  0.06584  0.04306  0.03847
## Cumulative Proportion   0.1489  0.23194  0.29777  0.34083  0.37930
##                             PC6      PC7      PC8      PC9    PC10   PC11
## Standard deviation     12.21052 11.05840 10.94492 10.59140 9.57657 9.4493
## Proportion of Variance  0.03507  0.02876  0.02817  0.02638 0.02157 0.0210
## Cumulative Proportion   0.41437  0.44313  0.47130  0.49769 0.51926 0.5403
##                           PC12   PC13    PC14    PC15   PC16    PC17
## Standard deviation     9.22659 8.8220 8.66863 8.43185 8.2746 8.19031
## Proportion of Variance 0.02002 0.0183 0.01767 0.01672 0.0161 0.01578
## Cumulative Proportion  0.56028 0.5786 0.59626 0.61298 0.6291 0.64486
##                           PC18    PC19    PC20    PC21    PC22   PC23
## Standard deviation     7.86272 7.84561 7.74753 7.64167 7.41462 7.3207
## Proportion of Variance 0.01454 0.01448 0.01412 0.01373 0.01293 0.0126
## Cumulative Proportion  0.65940 0.67388 0.68799 0.70173 0.71466 0.7273
##                           PC24   PC25    PC26    PC27   PC28    PC29
## Standard deviation     7.09512 7.0817 6.86791 6.71857 6.6190 6.57295
## Proportion of Variance 0.01184 0.0118 0.01109 0.01062 0.0103 0.01016
## Cumulative Proportion  0.73910 0.7509 0.76199 0.77261 0.7829 0.79307
##                           PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     6.50142 6.38411 6.31688 6.10274 6.07004 5.96433
## Proportion of Variance 0.00994 0.00959 0.00938 0.00876 0.00867 0.00837
## Cumulative Proportion  0.80302 0.81260 0.82199 0.83075 0.83941 0.84778
##                           PC36    PC37    PC38    PC39    PC40   PC41
## Standard deviation     5.93322 5.87286 5.82866 5.67923 5.63268 5.5707
## Proportion of Variance 0.00828 0.00811 0.00799 0.00759 0.00746 0.0073
## Cumulative Proportion  0.85606 0.86417 0.87216 0.87975 0.88721 0.8945
##                           PC42   PC43    PC44    PC45   PC46    PC47
## Standard deviation     5.51265 5.4555 5.37942 5.32142 5.1743 5.14470
## Proportion of Variance 0.00715 0.0070 0.00681 0.00666 0.0063 0.00623
## Cumulative Proportion  0.90165 0.9086 0.91546 0.92212 0.9284 0.93464
##                           PC48    PC49    PC50    PC51   PC52    PC53
## Standard deviation     5.01790 4.82436 4.77879 4.69168 4.5637 4.49039
## Proportion of Variance 0.00592 0.00547 0.00537 0.00518 0.0049 0.00474
## Cumulative Proportion  0.94057 0.94604 0.95141 0.95659 0.9615 0.96623
##                           PC54   PC55    PC56    PC57    PC58    PC59
## Standard deviation     4.41142 4.2741 4.21355 4.08613 3.91956 3.78810
## Proportion of Variance 0.00458 0.0043 0.00418 0.00393 0.00361 0.00337
## Cumulative Proportion  0.97081 0.9751 0.97928 0.98320 0.98682 0.99019
##                           PC60    PC61    PC62   PC63      PC64
## Standard deviation     3.52405 3.22882 3.15278 2.9856 1.341e-14
## Proportion of Variance 0.00292 0.00245 0.00234 0.0021 0.000e+00
## Cumulative Proportion  0.99311 0.99557 0.99790 1.0000 1.000e+00``````
``````#sdev refers to the standard deviation of principal components.
pca_prcomp\$sdev``````
``````##  [1] 2.516378e+01 1.878637e+01 1.673078e+01 1.353082e+01 1.278895e+01
##  [6] 1.221052e+01 1.105840e+01 1.094492e+01 1.059140e+01 9.576574e+00
## [11] 9.449313e+00 9.226590e+00 8.821954e+00 8.668634e+00 8.431849e+00
## [16] 8.274578e+00 8.190308e+00 7.862721e+00 7.845612e+00 7.747529e+00
## [21] 7.641665e+00 7.414624e+00 7.320674e+00 7.095120e+00 7.081674e+00
## [26] 6.867907e+00 6.718573e+00 6.618968e+00 6.572955e+00 6.501420e+00
## [31] 6.384107e+00 6.316878e+00 6.102743e+00 6.070035e+00 5.964333e+00
## [36] 5.933221e+00 5.872856e+00 5.828663e+00 5.679232e+00 5.632684e+00
## [41] 5.570718e+00 5.512649e+00 5.455510e+00 5.379416e+00 5.321422e+00
## [46] 5.174272e+00 5.144699e+00 5.017899e+00 4.824356e+00 4.778789e+00
## [51] 4.691679e+00 4.563740e+00 4.490394e+00 4.411423e+00 4.274070e+00
## [56] 4.213548e+00 4.086132e+00 3.919560e+00 3.788098e+00 3.524054e+00
## [61] 3.228818e+00 3.152782e+00 2.985601e+00 1.341106e-14``````
``````## variance explained by each PC cumulatively
cumsum(pca_prcomp\$sdev^2)/sum(pca_prcomp\$sdev^2)``````
``````##  [1] 0.1489294 0.2319364 0.2977720 0.3408323 0.3793002 0.4143671 0.4431287
##  [8] 0.4713030 0.4976867 0.5192567 0.5402571 0.5602793 0.5785838 0.5962576
## [15] 0.6129791 0.6290826 0.6448598 0.6594001 0.6738773 0.6879947 0.7017289
## [22] 0.7146592 0.7272638 0.7391037 0.7508988 0.7619925 0.7726091 0.7829132
## [29] 0.7930745 0.8030158 0.8126016 0.8219866 0.8307461 0.8394120 0.8477786
## [36] 0.8560582 0.8641702 0.8721606 0.8797465 0.8872086 0.8945074 0.9016548
## [43] 0.9086548 0.9154609 0.9221211 0.9284180 0.9346431 0.9405652 0.9460392
## [50] 0.9514103 0.9565874 0.9614860 0.9662284 0.9708055 0.9751019 0.9792776
## [57] 0.9832045 0.9868178 0.9901928 0.9931137 0.9955657 0.9979035 1.0000000
## [64] 1.0000000``````
what’s in the prca_prcomp object
``names(pca_prcomp)``
``## [1] "sdev"     "rotation" "center"   "scale"    "x"``
``````## the first two PCs for the first 6 tissues
``````##              PC1         PC2
## CNS    -19.79578   0.1152691
## CNS    -21.54610  -1.4573503
## CNS    -25.05662   1.5260929
## RENAL  -37.40954 -11.3894784
## BREAST -50.21864  -1.3461737
## CNS    -26.43520   0.4629819``````
``````PC1_and_PC2<- data.frame(PC1=pca_prcomp\$x[,1], PC2= pca_prcomp\$x[,2], type = rownames(pca_prcomp\$x))

## plot PCA plot
library(ggplot2)
ggplot(PC1_and_PC2, aes(x=PC1, y=PC2, col=type)) + geom_point() + geom_text(aes(label = type), hjust=0, vjust=0)``````
``````#This returns each principal components loadings
pca_prcomp\$rotation[1:6, 1:6]``````
``````##            PC1           PC2         PC3          PC4          PC5
## 1 -0.005096247  0.0009839929 0.002116058  0.007628801 -0.012118316
## 2 -0.001642354  0.0034355664 0.008049350  0.004910196 -0.007412249
## 3 -0.002509243 -0.0015838271 0.004746350  0.008769557 -0.012426296
## 4  0.004940063  0.0078435347 0.013716214  0.011378816 -0.014851587
## 5 -0.003365039 -0.0002680479 0.010677453 -0.005249648 -0.003317312
## 6  0.001382038 -0.0034431320 0.003167842  0.007425971 -0.002879251
##             PC6
## 1  0.0061392242
## 2  0.0114429879
## 3 -0.0002860562
## 4 -0.0065009935
## 5 -0.0003513787
## 6  0.0003210365``````

Use singluar value decomposition

in a svd analysis, a matrix n x p matrix X is decomposed by `X = U*D*V`: 1.U is an m×n orthogonal matrix. 2.V is an n×n orthogonal matrix. 3.D is an n×n diagonal matrix.
PCs: Z = XV or Z = UD (U are un-scaled PCs)
Some facts of PCA:
k th column of Z, Z(k), is the k th PC.(the k th pattern)
PC loadings: V k th column of V, V(k) is the k th PC loading (feature weights). aka. the k th column of V encodes the associated k th pattern in feature space.
PC loadings: U k th column of U, U(k) is the k th PC loading (observation weights). aka. the k th column of U encodes the associated k th pattern in observation space.
Diagnal matrix: D diagnals in D: d(k) gives the strength of the k th pattern.
Variance explained by k th PC: d(k)^2 Total variance of the data: sum(d(k1)^2 + d(k2)^2 + …..d(k)^2+….)
proportion of variane explained by k th PC: d(k)^2 / sum(d(k1)^2 + d(k2)^2 + …..d(k)^2+….)
``````X<- t(scale(t(ncidat),center=TRUE,scale=FALSE))
# we transpose X again for svd
# usually there is no need to transpose the matrix and then transpose it back, but svd was written that rows are observations and columns are
# features.however, most genomic data represent observations (e.g. samples) in columns and features (e.g. genes) in columns.

sv<- svd(t(X))
U<- sv\$u
V<- sv\$v
D<- sv\$d

## U are un-scaled PC, Z is scaled
Z<- t(X)%*%V

## PCs
Z[1:6, 1:6]``````
``````##             [,1]        [,2]       [,3]      [,4]       [,5]     [,6]
## CNS    -19.79578   0.1152691  -5.968917  4.753293  -4.882164 18.92591
## CNS    -21.54610  -1.4573503  -9.019584  6.767942  -2.247604 17.07273
## CNS    -25.05662   1.5260929  -6.959653  2.785913 -10.819648 16.45389
## RENAL  -37.40954 -11.3894784  -5.407097 15.442094 -16.011475 33.09651
## BREAST -50.21864  -1.3461737 -17.599944 15.099862 -13.852847 16.94340
## CNS    -26.43520   0.4629819 -16.931456 11.389195  -6.742920 11.85838``````
``````## is the same as
pca_prcomp\$x[1:6, 1:6]``````
``````##              PC1         PC2        PC3       PC4        PC5      PC6
## CNS    -19.79578   0.1152691  -5.968917  4.753293  -4.882164 18.92591
## CNS    -21.54610  -1.4573503  -9.019584  6.767942  -2.247604 17.07273
## CNS    -25.05662   1.5260929  -6.959653  2.785913 -10.819648 16.45389
## RENAL  -37.40954 -11.3894784  -5.407097 15.442094 -16.011475 33.09651
## BREAST -50.21864  -1.3461737 -17.599944 15.099862 -13.852847 16.94340
## CNS    -26.43520   0.4629819 -16.931456 11.389195  -6.742920 11.85838``````
``````## PC loadings
V[1:6, 1:6]``````
``````##              [,1]          [,2]        [,3]         [,4]         [,5]
## [1,] -0.005096247  0.0009839929 0.002116058  0.007628801 -0.012118316
## [2,] -0.001642354  0.0034355664 0.008049350  0.004910196 -0.007412249
## [3,] -0.002509243 -0.0015838271 0.004746350  0.008769557 -0.012426296
## [4,]  0.004940063  0.0078435347 0.013716214  0.011378816 -0.014851587
## [5,] -0.003365039 -0.0002680479 0.010677453 -0.005249648 -0.003317312
## [6,]  0.001382038 -0.0034431320 0.003167842  0.007425971 -0.002879251
##               [,6]
## [1,]  0.0061392242
## [2,]  0.0114429879
## [3,] -0.0002860562
## [4,] -0.0065009935
## [5,] -0.0003513787
## [6,]  0.0003210365``````
``````## is the same as
pca_prcomp\$rotation[1:6, 1:6]``````
``````##            PC1           PC2         PC3          PC4          PC5
## 1 -0.005096247  0.0009839929 0.002116058  0.007628801 -0.012118316
## 2 -0.001642354  0.0034355664 0.008049350  0.004910196 -0.007412249
## 3 -0.002509243 -0.0015838271 0.004746350  0.008769557 -0.012426296
## 4  0.004940063  0.0078435347 0.013716214  0.011378816 -0.014851587
## 5 -0.003365039 -0.0002680479 0.010677453 -0.005249648 -0.003317312
## 6  0.001382038 -0.0034431320 0.003167842  0.007425971 -0.002879251
##             PC6
## 1  0.0061392242
## 2  0.0114429879
## 3 -0.0002860562
## 4 -0.0065009935
## 5 -0.0003513787
## 6  0.0003210365``````
``````# plot PC1 vs PC2

pc_dat<- data.frame(type = rownames(Z), PC1 = Z[,1], PC2= Z[,2])
ggplot(pc_dat,aes(x=PC1, y=PC2, col=type)) + geom_point() + geom_text(aes(label = type), hjust=0, vjust=0)``````
We get the same results as from the `prcomp` function.

Variance explained for each PC

``````varex = 0
cumvar = 0
denom = sum(D^2)
for(i in 1:length(D)){
varex[i] = D[i]^2/denom
cumvar[i] = sum(D[1:i]^2)/denom
}

## variance explained by each PC cumulatively
cumvar``````
``````##  [1] 0.1489294 0.2319364 0.2977720 0.3408323 0.3793002 0.4143671 0.4431287
##  [8] 0.4713030 0.4976867 0.5192567 0.5402571 0.5602793 0.5785838 0.5962576
## [15] 0.6129791 0.6290826 0.6448598 0.6594001 0.6738773 0.6879947 0.7017289
## [22] 0.7146592 0.7272638 0.7391037 0.7508988 0.7619925 0.7726091 0.7829132
## [29] 0.7930745 0.8030158 0.8126016 0.8219866 0.8307461 0.8394120 0.8477786
## [36] 0.8560582 0.8641702 0.8721606 0.8797465 0.8872086 0.8945074 0.9016548
## [43] 0.9086548 0.9154609 0.9221211 0.9284180 0.9346431 0.9405652 0.9460392
## [50] 0.9514103 0.9565874 0.9614860 0.9662284 0.9708055 0.9751019 0.9792776
## [57] 0.9832045 0.9868178 0.9901928 0.9931137 0.9955657 0.9979035 1.0000000
## [64] 1.0000000``````
It is the same as the result of `cumsum(pca_prcomp\$sdev^2)/sum(pca_prcomp\$sdev^2)` above.
Screeplot
``````par(mfrow=c(1,2))
plot(1:length(D),varex,type="l",lwd=2,xlab="PC",ylab="% Variance Explained")
plot(1:length(D),cumvar,type="l",lwd=2,xlab="PC",ylab="Cummulative Variance Explained")``````