Diving into Genetics and Genomics: Heatmap demystified part II: Using heatmap to represent continuous values

Using heatmap to represent continuous values.

In my last blog post, I showed an example to use heatmap to represent discrete values. I am going to continue the theme to introduce heatmap to represent continuous values. As I mentioned before, I will stress on three main points:

Whether or not scale your data (center/standardize your data or not).

Make sure you know the range of the data and do reasonable color mapping.

How to perform the clustering. (which distance measure and linkage method to use).

In order to make the example reproducible and interesting to biologists, I am going to use an RNAseq data set downloaded by the `recount`package. The data contains 347 single cell RNA-seq data of 11 different distinct populations. Please read this Nature biotechnology paper: Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. I will try to reproduce the figures in the paper.

Download the data by recount

#source("https://bioconductor.org/biocLite.R")

## use development branch of bioc
# useDevel()
#biocLite("recount")
library(recount)
library(dplyr)

## if you go to the paper you can find the project name
project = 'SRP041736'

# I already downloaded it, so comment it out
# download_study(project)

## load the data
load(file.path(project, 'rse_gene.Rdata'))

#We can explore a bit this RangedSummarizedExperiment 
rse_gene

## class: RangedSummarizedExperiment 
## dim: 23779 694 
## metadata(0):
## assays(1): counts
## rownames(23779): 1 10 ... 9994 9997
## rowData names(3): gene_id bp_length symbol
## colnames(694): SRR1274192 SRR1274193 ... SRR1275349 SRR1275353
## colData names(21): project sample ... title characteristics

colData(rse_gene) %>% head()

## DataFrame with 6 rows and 21 columns
##                project      sample  experiment         run
##            <character> <character> <character> <character>
## SRR1274192   SRP041736   SRS603194   SRX534477  SRR1274192
## SRR1274193   SRP041736   SRS603194   SRX534477  SRR1274193
## SRR1274194   SRP041736   SRS603195   SRX534478  SRR1274194
## SRR1274195   SRP041736   SRS603195   SRX534478  SRR1274195
## SRR1274196   SRP041736   SRS603196   SRX534479  SRR1274196
## SRR1274197   SRP041736   SRS603196   SRX534479  SRR1274197
##            read_count_as_reported_by_sra reads_downloaded
##                                <integer>        <integer>
## SRR1274192                      16260602         16260602
## SRR1274193                         76776            76776
## SRR1274194                      13225500         13225500
## SRR1274195                         64552            64552
## SRR1274196                      16521630         16521630
## SRR1274197                         83534            83534
##            proportion_of_reads_reported_by_sra_downloaded paired_end
##                                                 <numeric>  <logical>
## SRR1274192                                              1       TRUE
## SRR1274193                                              1       TRUE
## SRR1274194                                              1       TRUE
## SRR1274195                                              1       TRUE
## SRR1274196                                              1       TRUE
## SRR1274197                                              1       TRUE
##            sra_misreported_paired_end mapped_read_count        auc
##                             <logical>         <integer>  <numeric>
## SRR1274192                      FALSE          16090529 1541642379
## SRR1274193                      FALSE             69879    2016766
## SRR1274194                      FALSE          13092059 1258982469
## SRR1274195                      FALSE             59210    1708001
## SRR1274196                      FALSE          16368219 1580611206
## SRR1274197                      FALSE             76808    2217637
##            sharq_beta_tissue sharq_beta_cell_type
##                  <character>          <character>
## SRR1274192                NA                   NA
## SRR1274193                NA                   NA
## SRR1274194                NA                   NA
## SRR1274195                NA                   NA
## SRR1274196                NA                   NA
## SRR1274197                NA                   NA
##            biosample_submission_date biosample_publication_date
##                          <character>                <character>
## SRR1274192   2014-04-22T17:28:44.407    2014-08-02T00:44:06.583
## SRR1274193   2014-04-22T17:28:44.407    2014-08-02T00:44:06.583
## SRR1274194   2014-04-22T17:28:45.247    2014-08-02T00:44:07.100
## SRR1274195   2014-04-22T17:28:45.247    2014-08-02T00:44:07.100
## SRR1274196   2014-04-22T17:28:45.343    2014-08-02T00:44:07.150
## SRR1274197   2014-04-22T17:28:45.343    2014-08-02T00:44:07.150
##              biosample_update_date avg_read_length geo_accession
##                        <character>       <integer>   <character>
## SRR1274192 2014-08-02T00:44:06.583             202            NA
## SRR1274193 2014-08-02T00:44:06.583              60            NA
## SRR1274194 2014-08-02T00:44:07.100             202            NA
## SRR1274195 2014-08-02T00:44:07.100              60            NA
## SRR1274196 2014-08-02T00:44:07.150             202            NA
## SRR1274197 2014-08-02T00:44:07.150              60            NA
##              bigwig_file       title characteristics
##              <character> <character> <CharacterList>
## SRR1274192 SRR1274192.bw          NA              NA
## SRR1274193 SRR1274193.bw          NA              NA
## SRR1274194 SRR1274194.bw          NA              NA
## SRR1274195 SRR1274195.bw          NA              NA
## SRR1274196 SRR1274196.bw          NA              NA
## SRR1274197 SRR1274197.bw          NA              NA

## At the gene level, the row data includes the gene ENTREZ ids, the gene
## symbols and the sum of the reduced exons widths, which can be used for 
## taking into account the gene length.
rowData(rse_gene)

## DataFrame with 23779 rows and 3 columns
##           gene_id bp_length      symbol
##       <character> <integer> <character>
## 1               1      4027        A1BG
## 2              10      1317        NAT2
## 3             100      1532         ADA
## 4            1000      4473        CDH2
## 5       100008589      5071     RNA28S5
## ...           ...       ...         ...
## 23775        9991      8234       PTBP3
## 23776        9992       803       KCNE2
## 23777        9993      4882       DGCR2
## 23778        9994      6763    CASP8AP2
## 23779        9997      1393        SCO2

# browse_study(project)

Preparing the metadata

It turns out that the metadata returned by recount are many NAs. I have to go to SRA run selector to find the metadata. great thanks to@Ayush.

search SRP041736 in the search box, and it will return all the runs for the project, and I downloaded the RunInfo Table.

SRAdb bioconductor package maybe able to search and download the metadata programmatically.

Aso check sra-search from NCBI.

Now, read in the metadata downloaded.

library(readr)
sra.runs<- read_tsv("~/Downloads/SraRunTable.txt")

## much more info I need!
head(sra.runs)

## Source: local data frame [6 x 35]
## 
##    BioSample_s Experiment_s Library_Name_s LoadDate_s MBases_l MBytes_l
##          (chr)        (chr)          (chr)     (date)    (int)    (int)
## 1 SAMN02731786    SRX534477         2338_1 2014-05-08     1566     1094
## 2 SAMN02731786    SRX534477         2338_1 2014-05-08        2        1
## 3 SAMN02731795    SRX534478        2338_10 2014-05-08     1273      892
## 4 SAMN02731795    SRX534478        2338_10 2014-05-08        1        1
## 5 SAMN02731796    SRX534479        2338_11 2014-05-08     1591     1114
## 6 SAMN02731796    SRX534479        2338_11 2014-05-08        2        1
## Variables not shown: Run_s (chr), SRA_Sample_s (chr), Sample_Name_s (chr),
##   age_s (chr), biomaterial_provider_s (chr), cell_line_s (chr), disease_s
##   (chr), disease_stage_s (chr), population_s (chr), sex_s (chr), tissue_s
##   (chr), Assay_Type_s (chr), AssemblyName_s (chr), BioProject_s (chr),
##   BioSampleModel_s (chr), Center_Name_s (chr), Consent_s (chr),
##   InsertSize_l (int), LibraryLayout_s (chr), LibrarySelection_s (chr),
##   LibrarySource_s (chr), Organism_s (chr), Platform_s (chr), ReleaseDate_s
##   (date), SRA_Study_s (chr), g1k_analysis_group_s (chr), g1k_pop_code_s
##   (chr), isolate_s (chr), source_s (chr)

# View(sra.runs)

The function scale_counts() helps you scale them in a way that is tailored to Rail-RNA output.

## Scale counts by taking into account the total coverage per sample
rse <- scale_counts(rse_gene)

## check the count table

assay(rse)[1:6, 1:20]

##           SRR1274192 SRR1274193 SRR1274194 SRR1274195 SRR1274196
## 1                330          0          6          0        265
## 10                94          0          0          0          0
## 100             1905       3451          0          0          4
## 1000               3          0          8          0          6
## 100008589       2026       1567        921       1358       1622
## 100009613          0          0          0          0          0
##           SRR1274197 SRR1274198 SRR1274199 SRR1274200 SRR1274201
## 1                  0          1          0        430          0
## 10                 0          0          0          0          0
## 100                0        730          0          0          0
## 1000               0          6          0          6          0
## 100008589       1046       1445       1809       1283       2364
## 100009613          0          0          0          0          0
##           SRR1274202 SRR1274203 SRR1274204 SRR1274205 SRR1274206
## 1                 54          0        931        959          0
## 10                 0          0          0          0          0
## 100               12          0          0          0        452
## 1000               0          0          0          0         12
## 100008589       1703       1455       1812        843       2942
## 100009613          0          0          0          0          0
##           SRR1274207 SRR1274208 SRR1274209 SRR1274210 SRR1274211
## 1                  0          5          0        141          0
## 10                 0          0          0          0          0
## 100             2400         23          0        218          0
## 1000               0          3          0          2          0
## 100008589       4737       1047          0       3030       1993
## 100009613          0          0          0          0          0

match the sra.runs$Run_s to the colData(rse)$run

sra.runs$Run_s %in% colData(rse)$run %>% table()

## .
## FALSE  TRUE 
##     2   694

colData(rse)$run  %in% sra.runs$Run_s %>% table()

## .
## TRUE 
##  694

# which runs are not avaiable in the count matrix?
sra.runs$Run_s[! sra.runs$Run_s %in% colData(rse)$run]

## [1] "SRR1274250" "SRR1274344"

## there should be two same samples, one is deep sequenced, the other is shallow sequenced (downsampled)
colData(rse) %>% as.data.frame() %>% group_by(sample) %>% summarise(n=n()) %>% arrange(n) %>% head()

## Source: local data frame [6 x 2]
## 
##      sample     n
##       (chr) (int)
## 1 SRS603223     1
## 2 SRS603270     1
## 3 SRS603194     2
## 4 SRS603195     2
## 5 SRS603196     2
## 6 SRS603197     2

sra.runs %>% dplyr::filter(Run_s == "SRR1274250") %>% select(SRA_Sample_s)

## Source: local data frame [1 x 1]
## 
##   SRA_Sample_s
##          (chr)
## 1    SRS603223

sra.runs %>% dplyr::filter(Run_s == "SRR1274344") %>% select(SRA_Sample_s)

## Source: local data frame [1 x 1]
## 
##   SRA_Sample_s
##          (chr)
## 1    SRS603270

colData(rse) %>% as.data.frame() %>% dplyr::filter(sample == "SRS603223")

##     project    sample experiment        run read_count_as_reported_by_sra
## 1 SRP041736 SRS603223  SRX534506 SRR1274251                         76576
##   reads_downloaded proportion_of_reads_reported_by_sra_downloaded
## 1            76576                                              1
##   paired_end sra_misreported_paired_end mapped_read_count     auc
## 1       TRUE                      FALSE             68808 1984754
##   sharq_beta_tissue sharq_beta_cell_type biosample_submission_date
## 1              <NA>                 <NA>   2014-04-22T17:28:47.330
##   biosample_publication_date   biosample_update_date avg_read_length
## 1    2014-08-02T00:44:08.360 2014-08-02T00:44:08.360              60
##   geo_accession   bigwig_file title characteristics
## 1          <NA> SRR1274251.bw  <NA>              NA

colData(rse) %>% as.data.frame() %>% dplyr::filter(sample == "SRS603270")

##     project    sample experiment        run read_count_as_reported_by_sra
## 1 SRP041736 SRS603270  SRX534553 SRR1274345                        169528
##   reads_downloaded proportion_of_reads_reported_by_sra_downloaded
## 1           169528                                              1
##   paired_end sra_misreported_paired_end mapped_read_count      auc
## 1       TRUE                      FALSE            150720 14263581
##   sharq_beta_tissue sharq_beta_cell_type biosample_submission_date
## 1              <NA>                 <NA>   2014-04-22T17:29:04.727
##   biosample_publication_date   biosample_update_date avg_read_length
## 1    2014-08-02T00:44:25.317 2014-08-02T00:44:25.317             196
##   geo_accession   bigwig_file title characteristics
## 1          <NA> SRR1274345.bw  <NA>              NA

These two samples are shallow sequenced ones according to the mapped_read_count number above. select out the smaller one to be the shallow sequenced ones.

shallow.run<- colData(rse) %>% as.data.frame() %>% group_by(sample) %>% dplyr::slice(which.min(mapped_read_count)) %>% ungroup()

boxplot(shallow.run$mapped_read_count)
title(main = "number of mapped reads for the shallow sequenced samples")

## merge the meta data
shallow.run <- shallow.run %>% left_join(sra.runs, by= c("run" = "Run_s"))

## summary of mapped_read_count in shallow ones. median of  ~0.2 million reads
summary(shallow.run$mapped_read_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11540  128900  201600  239800  323400  917300

## exclude the shallow runs to find the deep runs
shallow.runs<- shallow.run$run

deep.run<- colData(rse) %>% as.data.frame() %>% dplyr::filter(! run %in% shallow.runs) %>% ungroup()

boxplot(deep.run$mapped_read_count)
title(main = "number of mapped reads for the deep sequenced samples")

## merge the meta data
deep.run <- deep.run %>% left_join(sra.runs, by = c("run" = "Run_s"))

The data are always messy, the cell line name is spread to different columns. I will use regular expression to extract the first word until the space from the population_s column.

shallow.run %>% mutate(cell_line = gsub("([^ ]+).+", "\\1", population_s)) %>% select(cell_line) %>% table()

## .
##   2338   2339     BJ   chip   GW16   GW21 GW21+3   HL60    iPS   K562 
##     22     17     37     46     26      8     16     54     24     42 
##   Kera    NPC  plate 
##     40     15      1

The names start with chip are K562 cells if you check cell_line_s column. The one starts with plate is the bulk sample (100 cells for K562 cells).

from the method session of the paper:

Hierarchical clustering of the top 500 PCA genes across 301 cells was also performed in the Fluidigm SINGuLAR package. Genes are clustered on the basis of Pearson correlation. Samples are clustered on the basis of a Euclidian distance matrix with complete linkage.

excluding the samples start with chip and plate:

22 + 17 + 37 + 26 + 8 + 16 + 54 + 24 + 42 + 40 + 15

## [1] 301

It is 301, the same number of samples used in the paper, so I guess they excluded those k562 cells

exclude those k562 cells to get only 301 single cells.

shallow.run <- shallow.run %>% mutate(cell_line = gsub("([^ ]+).+", "\\1", population_s)) 
shallow.run<- shallow.run %>% filter(! cell_line %in% c("chip", "plate"))

To reproduce figure 2C in the paper, I need to make a PCA plot. I will use SVD sigular value decomposition to accomplish that. I had a pretty detailed write-up for SVD and PCA here. Of course, one can use the functions prcomp() and princomp() from the built-in R stats package orPCA() from FactoMineR package.

## a peek into the count matrix
assay(rse)[1:6, 1:6]

##           SRR1274192 SRR1274193 SRR1274194 SRR1274195 SRR1274196
## 1                330          0          6          0        265
## 10                94          0          0          0          0
## 100             1905       3451          0          0          4
## 1000               3          0          8          0          6
## 100008589       2026       1567        921       1358       1622
## 100009613          0          0          0          0          0
##           SRR1274197
## 1                  0
## 10                 0
## 100                0
## 1000               0
## 100008589       1046
## 100009613          0

## select out only the counts for shallow ones
shallow.counts<- assay(rse)[, colnames(rse) %in% shallow.run$run]

## I need the cell type, tissue type info for each run in the order of runs (each column is a run) in the matrix

shallow.run.cell<- left_join(data.frame(run = colnames(shallow.counts)), shallow.run) %>% select(run, cell_line, tissue_s)

## Joining by: "run"

## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector

## tidy the dataframe a bit, set the NPC tissue from N/A to brain for comparisons with GW lines, and iPS tissue from N/A to pluripotent

shallow.run.cell<- shallow.run.cell %>% 
        mutate(tissue_s = gsub("(mammary gland);.+", "\\1", tissue_s)) %>% 
        mutate(tissue_s = replace(tissue_s, tissue_s== "N/A" & cell_line == "iPS", "pluripotent")) %>% 
        mutate(tissue_s = replace(tissue_s, tissue_s== "N/A" & cell_line == "NPC", "brain"))

PCA in action

## use log2 count for PCA and center the data for each sample.
X<- t(scale(t(log2(shallow.counts+1)),center=TRUE,scale=FALSE))
sv<- svd(t(X))
U<- sv$u
V<- sv$v
D<- sv$d
Z<- t(X)%*%V

## put PC1 and PC2 into a dataframe and add the cell line info side by side
pc_dat<- data.frame(cell.line = shallow.run.cell$cell_line, PC1 = Z[,1], PC2 = Z[,2])

## make figure with ggplot2
library(ggplot2)
library(ggthemes)
ggplot(pc_dat, aes(x=PC1, y=PC2, col=cell.line)) + 
        geom_point() + 
        theme_bw() +
        theme(panel.border = element_blank(), 
              panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(), 
              axis.line.x = element_line(color="black", size = 0.6),
              axis.line.y = element_line(color="black", size = 0.6))

Well, it is very similar to the original figure in the paper except the color and shape of the points. I mannually selected colors fromhttp://www.color-hex.com/

p<- ggplot(pc_dat, aes(x=PC1, y=PC2)) + 
        geom_point(aes(color = cell.line, shape = cell.line)) + 
        scale_colour_manual(values=c("#24cdd4","#9e379f","#6673b7","#9FD175", "#9FD175", 
                                     "#9FD175", "#ea018d", "#0d060f", "#eede51", "#6673b7", "#9FD175")) +   
        scale_shape_manual(values=c(19, 19, 4, 19, 8, 3,19,19,19,19, 4))+
        theme_bw() +
        theme(panel.border = element_blank(), 
              panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(), 
              axis.line.x = element_line(color="black", size = 0.6),
              axis.line.y = element_line(color="black", size = 0.6))

p

Let’s fix the legend a bit to mimic the original figure.

p + guides(col = guide_legend(ncol=2)) + theme(legend.position = c(.9, .75))

It is still not quite the same as the original figure as the legend of cell types should be grouped together. I will leave it to you to google and find a way to change the order of the legend. I usually import the figure to inkscape and mannually change the positions and/or orders. You may also notice that the X and Y axis scales are different from the orignal paper, I guess it is because that the authors used RSEM to process the RNA-seq data and used TPM (rather than raw counts in my example) to plot PCA. Nevertheless, the distribution of the clusters are very similar.

Now it is time to reproduce figure 3b.

From the paper:

Hierarchical clustering of 65 single cells across 500 genes with the strongest PC1–PC3 loading scores identifies four major groups of cells (I–IV), and
k-means clustering identifies three clusters of genes (red, yellow and green)

So figure 3b only used cells that are from brain tissues. I want to show you how to do clustering for all the 11 cell types first then we can adventure ourselves by focusing on the brain tissue cells only (NPC, GW16, GW21, GW21+3). It is just a matter of subsetting only the cell lines that are of brain origin.

Before drawing a biclustered heatmap, one needs to select features (genes) first becasue hierachincal clustering can be quite time/memory consuming if you have too many features. In addition, many features do not contribute to the clustering, so we can drop them.

There are many ways to select the features, one of the most commonly used method is to use the most variable genes across samples.

library(genefilter)
rv<- rowVars(X)
## select the top 250 most variable genes for clustering
idx<- order(-rv)[1:250]

library(ComplexHeatmap) ## a very nice package from Zuguang Gu, he is very responsive.
Heatmap(X[idx,], name = "log2 RNAseq\ncounts scaled", 
        show_row_names = FALSE, show_column_names = FALSE, 
        row_dend_reorder = TRUE, column_dend_reorder = TRUE)

Continue reading at https://rpubs.com/crazyhottommy/heatmap_demystified

Diving into Genetics and Genomics

My github papge

Monday, September 12, 2016

Heatmap demystified part II: Using heatmap to represent continuous values

Using heatmap to represent continuous values.

Download the data by recount

Preparing the metadata

PCA in action

No comments:

Post a Comment

Labels

My Blog List