Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, August 27, 2013

How to make a heatmap based on ChIP-seq data by R

update on 04/20/2016.
I noticed that this post is the most frequently visited one. it has been almost 3 years
since I wrote this post. Now, there are various tools for this purpose. See a
post on biostars.

well, I recently just went through the whole process for making a heatmap based on a ChIP-seq data set. If you do not know the technique, google it :)  http://en.wikipedia.org/wiki/ChIP-sequencing

Often, you have a ChIP-seq data that are mapped to the reference genome ( a bam file). You want to plot the sequence tag intensity around certain features ( transcription start sites, gene body, enhancers, or any other genomic region you defined).

you can make an average plot http://crazyhottommy.blogspot.com/2013/04/how-to-make-tss-plot-using-rna-seq-and.html. I will need to re-write this one though, the code format is just too bad (I have to learn how to embed R and python code into the blog) ....and the Y axis is not normalized to counts per million.

you may also want to generate a heatmap with the same data.  see ngsplot for examples https://code.google.com/p/ngsplot/.  If you do not want to code R by yourself, try it. It has been improved a lot since last time I checked it. I once asked a question in the google group : https://groups.google.com/forum/#!topic/ngsplot-discuss/efHQ-P-14XM.

Seqmonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/  from Simon Andrews can also plot this kind of figure very easily, I am just not satisfied with the picture quality, and I want more customized control of the picture.


I will just paste my code below, and it is heavily commented, you should be able to follow it fairly easily.
update on 05/05/2015, I put the code in a gist instead:
# This R script is to generate the TF or histone modification heatmap
# at certain genomic features (TSS, enhancers) from the ChIP-seq data
# the input matrix is got from Homer software. alternative to R, use cluster3 to cluster, and visualize by # java Treeviewer
# generate the matrix by Homer: annotatePeaks.pl peak_file.txt hg19 -size 6000 -hist 10 -ghist -d TF1/ # > outputfile_matrix.txt
# see several posts for heatmap:
# http://davetang.org/muse/2010/12/06/making-a-heatmap-with-r/
# http://www.r-bloggers.com/r-using-rcolorbrewer-to-colour-your-figures-in-r/
# 08/20/13 by Tommy Tang
# it is such a simple script but took me several days to get it work...I mean the desired
# color setting of the heatmap. Making a pretty figure takes time :)
# use either heatmap.2 from gplots or pheatmap
library(gplots)
library(pheatmap)
# you can compare the heatmaps generated by heatmap.2 and pheatmap
setwd("/home/tommy/homer")
d1 <- read.table("outputfile_matrix.txt", header=T)
# have a look at the matrix
head(d1)
d1$Gene # the dataframe generated by homer has a column named Gene
m1<- as.matrix( d1[,2:ncol(d1)]) # the first column is the TSS id,
rownames(m1)<- d1$Gene # heatmap.2 works only on matrix, turn the dataframe to matrix, and # add the TSS id as the row name
m1<- log2(m1+1) # log2 transform the raw counts
#exam the data range
range(m1)
hist(m1) # histogram to look at distributions
# it is from the ChIP-seq count data, many of them are 0s. I transfromed the raw data
# by log2(m+1), so if the log2 value is 0, the raw number is also 0 (count)
# to compare different heatmaps, I have to map the value to the color using the
# break argument in pheatmap or heatmap.2 That's why I examed the data range.
# http://stackoverflow.com/questions/17820143/how-to-change-heatmap-2-color-range-in-r
# https://stat.ethz.ch/pipermail/bioconductor/2011-November/041866.html
# http://seqanswers.com/forums/showthread.php?p=114275&posted=1#post114275
bk = unique(c(seq(-0.1,3, length=100),seq(3,9.7,length=100)))
hmcols<- colorRampPalette(c("white","red"))(length(bk)-1)
# you can play around with the break points, 9.7 is the max of the matrix m1
# if just use bk = c(seq(-0.1,1, length=100),seq(1,12.5,length=100)) without the unique function
# the pheatmap gave me error message:
# Error in cut.default(x, breaks = breaks, include.lowest = T) :
# 'breaks' are not unique
# It is caused by concatenating several seq() results together, which share the same boundaries.
# try ?dist ?hclust
png("test.png", width=300, height = 800) # width and height are in pixel
heatmap.2(m1, col=hmcols, breaks = bk, Rowv= TRUE , Colv=FALSE, dendrogram="row", useRaster = TRUE, symkey=FALSE, symm=F, symbreaks=T, scale="none", trace="none", labRow=NA, labCol=NA)
# do not show the row and column labels
# trace argument should be put to "none", otherwise the trace is cyan and it will "eat" the heatmap
# I asked in Biostar http://www.biostars.org/p/79444/#79457
# look at the documentation for #heatmap.2 http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/gplots/html/heatmap.2.html
# or you can use pheatmap
pheatmap(m1, color=hmcols, breaks= bk, cluster_rows=TRUE, cluster_cols=FALSE, legend=FALSE, show_rownames=FALSE, show_colnames=FALSE)
dev.off()



The second gist:
#The X axis is -3 kb to 3 kb around TSS.
#It turns out that the heatmap.2 function use default hclust ( Hierachical Clustering) to cluster the #matrix.
#alternatively, we can use K-means clustering to cluster the data and to see what's the pattern look like.
km<- kmeans(m1,2) # determine how many cluster you want, I specify 2 here
m1.kmeans<- cbind(m1, km$cluster) # combine the cluster with the matrix
dim(m1.kmeans)
# [1] 903 602
# the last column is 602
o<- order(m1.kmeans[,602]) # order the last column
m1.keans<- m1.kmeans[o,] # order the matrix according to the order of the last column
pheatmap( m1.kmeans[,1:601], cluster_rows = F, cluster_cols = F, col= hmcols, breaks = bk, legend=FALSE, show_rownames=FALSE, show_colnames=FALSE)
# no need to cluster by pheatmap, so I set cluster_rows and cluster_cols to FALSE







That's all!

I hope you have learned something after reading it:)
===============================
update on 09/17/13
arrange the rows in the heatmap by the coverage from strong to weak

m.row.sum<- cbind(m1, rowSums(m1))
o1<- rev(order(m.row.sum[,602]))
m.row.sum<- m.row.sum[o1,]
bk = unique(c(seq(-0.1,3, length=100),seq(3,10.35,length=100)))
hmcols<- colorRampPalette(c("white","red"))(length(bk)-1)
pheatmap( m.row.sum[,1:601], cluster_rows = F, cluster_cols = F, col= hmcols, breaks = bk, legend=FALSE, show_rownames=FALSE, show_colnames=FALSE)





12 comments:

  1. I asked Simon Anders in Seqanwser how the Seqmonk software generates similar figures. It turns out that no special algorithm is used in the clustering. Basically, the rows in the heatmap are arranged by the coverage from strong to weak.
    And it looks much better.

    ReplyDelete
  2. Nice plot. Can I make it using TSS of all genes of human (bed file) and ChIP-seq peak (bed files)?

    ReplyDelete
    Replies
    1. No, you can only use ChIP-seq bam files or wig files.

      Delete
  3. Great tutorial. How do I add colory key for the heatmap?

    ReplyDelete
    Replies
    1. if you use heatmap.2, it will generate a color key. see this post for more tools and easier way to do it https://www.biostars.org/p/180314/

      Delete
  4. I had a lit of genes, which are significantly DE from RANseq and I also have same sample H3K9AC14 ChIPseq data. I would like to plot the group of genes from RNAseq promoter region binding profile based on the ChIPseq data. Do you know how to do it. My current situation is I extract the list from RNAseq and I know how to plot the TSS plot. But I could not extract the list of genes' chrom information from mm10 db for TSS plot.

    ReplyDelete
    Replies
    1. It seems you need some help to format the data. If you have the gene list and have the mm10 annotation, you can extract all the TSSs with 5kb window easily. Ask a local bioinformatic people for help. Or see a post here for more solutions https://www.biostars.org/p/180314/

      Delete
  5. Hi Tommy,

    How can I plot this over an entire genebody, using rat genome.

    Thank you

    ReplyDelete
    Replies
    1. please refer to this post https://www.biostars.org/p/180314/ ngsplot can do it, Genomation can do it as well

      Delete
  6. lets say the peakfile had 70000 rows , should i need to filter down a bit all plot all?

    ReplyDelete
  7. All thanks to Mr Anderson for helping with my profits and making my fifth withdrawal possible. I'm here to share an amazing life changing opportunity with you. its called Bitcoin / Forex trading options. it is a highly lucrative business which can earn you as much as $2,570 in a week from an initial investment of just $200. I am living proof of this great business opportunity. If anyone is interested in trading on bitcoin or any cryptocurrency and want a successful trade without losing notify Mr Anderson now.Whatsapp: (+447883246472 )
    Email: tdameritrade077@gmail.com

    ReplyDelete