Diving into Genetics and Genomics: log2 fold gene expression change heatmap

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday, December 19, 2013

log2 fold gene expression change heatmap

I was looking at two different microarray data sets. In each experiment, the authors knocked down a transcription factor (TF) and examined the gene expression change respectively. I wanted to know whether there are any co-upregulated or co-downregulated genes after knocking down the TFs.

I analyzed the microarray data by bioconductor. See one of the examples I blogged before
http://crazyhottommy.blogspot.com/2013/12/geoquery-to-access-geo-datasets.html

Because those are two different experiments done in different platforms and in different labs, I can not put the raw expression data together. Instead, I got the log2 fold change (KD vs Control) for each data set.

I got a topgene list for each data set from the limma topTable. The third column is the gene name, the fourth column is the logFC. I cut these two columns out, delete the first line (the header) and get rid of the double quote around the gene names:

tommy@tommy-ThinkPad-T420:~/tommy$ cut -f3,4 topgenes_shTF1_FDR0.05_lfc1.txt | sed '1 d' | sed 's/"//g' | sort | uniq > shTF1_topgenes.txt

tommy@tommy-ThinkPad-T420:~/tommy$ cut -f3,4 topgenes_shTF2.txt | sed '1 d' | sed 's/"//g' | sort | uniq > shTF2_topgenes.txt

Inner join these two lists together produced a file with three columns. After join command, the file became white space delimited, I changed it back to tab delimited.

tommy@tommy-ThinkPad-T420:~/tommy$ join shTF1_topgenes.txt shTF2_topgenes.txt | tr ' ' '\t' > co_regulated.txt

only keep the co-upregualted and co-downregulated genes:

tommy@tommy-ThinkPad-T420:~/tommy$ cat co_regulated.txt | awk '{if($2>0 && $3>0 || $2<0 && $3<0) print}' | head

Abhd5 1.22313483440383 0.718522081750001

Acta2 2.42294712761151 0.602606251249999

Acta2 2.7279547932607 0.602606251249999

Acta2 2.92237008906738 0.602606251249999

Akt3 1.62314377335472 0.717014153000002

Akt3 1.62314377335472 0.900214081749999

Akt3 1.62314377335472 1.10733403775

Akt3 2.21617654271768 0.717014153000002

Akt3 2.21617654271768 0.900214081749999

Akt3 2.21617654271768 1.10733403775

There are multiple probes for the same gene, only keep the first instance for each gene:

tommy@tommy-ThinkPad-T420:~/tommy$ cat co_up_or_down.txt | awk '!a[$1]++' > co_up_or_down_uniq.txt

Among these 116 genes, 85 are co-upregulated, 31 are co-down regulated.

Now, I can draw a heatmap based on the log2 fold changes.

The code above are heavily commented, you should be able to follow easily.
I once asked about the colour setting in the heatmap in Seqanwser http://seqanswers.com/forums/showthread.php?t=33023
Also see here http://stackoverflow.com/questions/17820143/how-to-change-heatmap-2-color-range-in-r
In the script, I mapped the colours to values:


bk = unique(c(seq(-5.73,-0.5, length=100),seq(-0.5,0, length=100), seq(0,5,length=100)))

hmcols<- colorRampPalette(c("green","black", "red"))(length(bk)-1)

So, the black and green coloured cells are down-regulated, the red cells are up-regulated.

After add the side bar:

In the dendrogram, I split the genes to two groups (co-up and co-down). In this particular example, I already pre-selected co-up and co-down regulated genes before I constructed the heatmap. If one did not do this and wants to find the clustered (heatmap.2 use hierarchical clustering as default, type ?hclust in R to see more) groups of gene members, one can find them by the code above.

The easiest way is to use cluster3.0 to cluster and Java tree viewer to visualize the result http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
It is interactive, you can choose the sub-groups just by selecting them directly using mouse.

Heatmap is no mystery. It just uses colours to represent the values for visualization. However, you need to know what you are doing. Make sure you know the different distance calculations among rows http://stat.ethz.ch/R-manual/R-patched/library/stats/html/dist.html
and the different clustering methods (hierarchical, K-means etc)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/hclust.html
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html

See also my previous post http://crazyhottommy.blogspot.com/2013/08/how-to-make-heatmap-based-on-chip-seq.html

5 comments:

UnknownJanuary 8, 2015 at 7:15 AM
Hi Tommy,

I have to ask you something, in this heatmap are computing the z scores on columns? I have a matrix of up and down regulated genes where I want to generate the heatmap based on log2(FC) of the fragment counts( not gene expression values). So my matrix is a log2 transformed value of ratio of fragments between 2 conditions like tumor vs non tumor having 4 patients. Now my matrix is clearly separated between up and down based on the positive and negative values. I find your code of high value as I want to generate such kind of heatmap for my data. I did produce making some tweaks in this code , only thing I want to understand if this is heatmap zscore is computed or not? and if so is it on columns or not? I feel the z score is not computed, you are taking directly the value of log2 fold change and using pair breaks to get a better visualization of the up and down regulation right? is it important in heatmaps like this that I have to compute the zscore and that too on columns? I am asking since my log2FC values are between tumor vs normal for each row of a gene and has 4 columns where each column represents each patient. Please let me know if you understand my question and if you can reply to my query. Below is the code tweak I did. I get confused between z score and breaks usage. if you can make some clarification of my question. I am clustering the dendrogram on col.

data4 is my matrix

bk = unique(c(seq(-5.73,-0.5, length=100),seq(-0.5,0, length=100), seq(0,5,length=100)))
hmcols<- colorRampPalette(c("green","black", "red"))(length(bk)-1)
heatmap.2(data4, breaks=bk, col=hmcols,trace="none", dendrogram = "col",density.info="none",cexCol=0.5, labRow=NA, symm=F,symkey=F,symbreaks=T, scale="none",keysize=1.5)
ReplyDelete
Replies
cataguiJune 3, 2015 at 10:04 PM
Hi Tommy
I am getting lost in the output of the clustering, I have 335 rows and I only get one group after

>table(ct)

My >plot (hc) has different clusters but maybe the heigh is not fine? The plot shows a maximun height of 7. Don't know how to attach a figure or table in this page to show my example.

Thanks,

Cata

names(hm)
m.afterclust<- data[rev(hm$rowInd),rev(hm$colInd)]

hc<- as.hclust(hm$rowDendrogram)
names(hc)
plot(hc) # rotate the dendrogram 90 degree, it is the same as in the heatmap

rect.hclust(hc,h=7) # based on the height of the tree, you can specify h

ct<- cutree(hc,h=7)
table(ct)
ct[hc$order]
ReplyDelete
Replies
MARGARET MAGOTHENovember 1, 2021 at 10:46 PM
All thanks to Mr Anderson for helping with my profits and making my fifth withdrawal possible. I'm here to share an amazing life changing opportunity with you. its called Bitcoin / Forex trading options. it is a highly lucrative business which can earn you as much as $2,570 in a week from an initial investment of just $200. I am living proof of this great business opportunity. If anyone is interested in trading on bitcoin or any cryptocurrency and want a successful trade without losing notify Mr Anderson now.Whatsapp: (+447883246472 )
Email: tdameritrade077@gmail.com
ReplyDelete
Replies

Add comment

Diving into Genetics and Genomics

My github papge

Thursday, December 19, 2013

log2 fold gene expression change heatmap

5 comments:

Labels

My Blog List