Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, December 19, 2013

log2 fold gene expression change heatmap

I was looking at two different microarray data sets. In each experiment, the authors knocked down a transcription factor (TF) and examined the gene expression change respectively. I wanted to know whether there are any co-upregulated or co-downregulated genes after knocking down the TFs.

I analyzed the microarray data by bioconductor. See one of the examples I blogged before
http://crazyhottommy.blogspot.com/2013/12/geoquery-to-access-geo-datasets.html

Because those are two different experiments done in different platforms and in different labs, I can not put the raw expression data together. Instead, I got the log2 fold change (KD vs Control) for each data set.

I got a topgene list for each data set from the limma topTable. The third column is the gene name, the fourth column is the logFC. I cut these two columns out, delete the first line (the header) and get rid of the double quote around the gene names:

tommy@tommy-ThinkPad-T420:~/tommy$ cut -f3,4 topgenes_shTF1_FDR0.05_lfc1.txt | sed '1 d' | sed 's/"//g' | sort | uniq > shTF1_topgenes.txt

tommy@tommy-ThinkPad-T420:~/tommy$ cut -f3,4 topgenes_shTF2.txt | sed '1 d' | sed 's/"//g' | sort | uniq > shTF2_topgenes.txt


Inner join these two lists together produced a file with three columns. After join command, the file became white space delimited, I changed it back to tab delimited.

tommy@tommy-ThinkPad-T420:~/tommy$ join shTF1_topgenes.txt shTF2_topgenes.txt | tr ' ' '\t' > co_regulated.txt

only keep the co-upregualted and co-downregulated genes:

tommy@tommy-ThinkPad-T420:~/tommy$ cat co_regulated.txt | awk '{if($2>0 && $3>0 || $2<0 && $3<0) print}' | head
Abhd5 1.22313483440383 0.718522081750001
Acta2 2.42294712761151 0.602606251249999
Acta2 2.7279547932607 0.602606251249999
Acta2 2.92237008906738 0.602606251249999
Akt3 1.62314377335472 0.717014153000002
Akt3 1.62314377335472 0.900214081749999
Akt3 1.62314377335472 1.10733403775
Akt3 2.21617654271768 0.717014153000002
Akt3 2.21617654271768 0.900214081749999
Akt3 2.21617654271768 1.10733403775

There are multiple probes for the same gene, only keep the first instance for each gene:

tommy@tommy-ThinkPad-T420:~/tommy$ cat co_up_or_down.txt | awk '!a[$1]++' > co_up_or_down_uniq.txt

Among these 116 genes, 85 are co-upregulated, 31 are co-down regulated.
Now, I can draw a heatmap based on the log2 fold changes.



The code above are heavily commented, you should be able to follow easily.
I once asked about the colour setting in the heatmap in Seqanwser http://seqanswers.com/forums/showthread.php?t=33023
Also see here http://stackoverflow.com/questions/17820143/how-to-change-heatmap-2-color-range-in-r
In the script, I mapped the colours to values:

bk = unique(c(seq(-5.73,-0.5, length=100),seq(-0.5,0, length=100), seq(0,5,length=100)))
hmcols<- colorRampPalette(c("green","black", "red"))(length(bk)-1)

So, the black and green coloured cells are down-regulated, the red cells are up-regulated.
After add the side bar:
In the dendrogram, I split the genes to two groups (co-up and co-down). In this particular example, I already pre-selected co-up and co-down regulated genes before I constructed the heatmap. If one did not do this and wants to find the clustered (heatmap.2 use hierarchical clustering as default, type ?hclust in R to see more) groups of gene members, one can find them by the code above.


The easiest way is to  use cluster3.0 to cluster and Java tree viewer to visualize the result http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
It is interactive, you can choose the sub-groups just by selecting  them directly using mouse.


Heatmap is no mystery. It just uses colours to represent the values for visualization. However, you need to know what you are doing. Make sure you know the different distance calculations among rows http://stat.ethz.ch/R-manual/R-patched/library/stats/html/dist.html
and the different clustering methods (hierarchical, K-means etc)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/hclust.html
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html

See also my previous post http://crazyhottommy.blogspot.com/2013/08/how-to-make-heatmap-based-on-chip-seq.html

4 comments:

  1. Hi Tommy,

    I have to ask you something, in this heatmap are computing the z scores on columns? I have a matrix of up and down regulated genes where I want to generate the heatmap based on log2(FC) of the fragment counts( not gene expression values). So my matrix is a log2 transformed value of ratio of fragments between 2 conditions like tumor vs non tumor having 4 patients. Now my matrix is clearly separated between up and down based on the positive and negative values. I find your code of high value as I want to generate such kind of heatmap for my data. I did produce making some tweaks in this code , only thing I want to understand if this is heatmap zscore is computed or not? and if so is it on columns or not? I feel the z score is not computed, you are taking directly the value of log2 fold change and using pair breaks to get a better visualization of the up and down regulation right? is it important in heatmaps like this that I have to compute the zscore and that too on columns? I am asking since my log2FC values are between tumor vs normal for each row of a gene and has 4 columns where each column represents each patient. Please let me know if you understand my question and if you can reply to my query. Below is the code tweak I did. I get confused between z score and breaks usage. if you can make some clarification of my question. I am clustering the dendrogram on col.

    data4 is my matrix

    bk = unique(c(seq(-5.73,-0.5, length=100),seq(-0.5,0, length=100), seq(0,5,length=100)))
    hmcols<- colorRampPalette(c("green","black", "red"))(length(bk)-1)
    heatmap.2(data4, breaks=bk, col=hmcols,trace="none", dendrogram = "col",density.info="none",cexCol=0.5, labRow=NA, symm=F,symkey=F,symbreaks=T, scale="none",keysize=1.5)

    ReplyDelete
    Replies
    1. Sorry, I did not know someone commented on this post...
      I did not use z-score, in the heatmap.2 function, you can specify an argument scale="row" or scale="column", the default is "none".

      Delete
  2. Hi Tommy
    I am getting lost in the output of the clustering, I have 335 rows and I only get one group after

    >table(ct)

    My >plot (hc) has different clusters but maybe the heigh is not fine? The plot shows a maximun height of 7. Don't know how to attach a figure or table in this page to show my example.

    Thanks,

    Cata




    names(hm)
    m.afterclust<- data[rev(hm$rowInd),rev(hm$colInd)]

    hc<- as.hclust(hm$rowDendrogram)
    names(hc)
    plot(hc) # rotate the dendrogram 90 degree, it is the same as in the heatmap

    rect.hclust(hc,h=7) # based on the height of the tree, you can specify h

    ct<- cutree(hc,h=7)
    table(ct)
    ct[hc$order]

    ReplyDelete
    Replies
    1. you may need to specify a smaller height (your max height is 7). say h=3. It all depends on your clustering.
      ct<- cutree(hc,h=3)

      Delete