Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, December 19, 2013

log2 fold gene expression change heatmap

I was looking at two different microarray data sets. In each experiment, the authors knocked down a transcription factor (TF) and examined the gene expression change respectively. I wanted to know whether there are any co-upregulated or co-downregulated genes after knocking down the TFs.

I analyzed the microarray data by bioconductor. See one of the examples I blogged before
http://crazyhottommy.blogspot.com/2013/12/geoquery-to-access-geo-datasets.html

Because those are two different experiments done in different platforms and in different labs, I can not put the raw expression data together. Instead, I got the log2 fold change (KD vs Control) for each data set.

I got a topgene list for each data set from the limma topTable. The third column is the gene name, the fourth column is the logFC. I cut these two columns out, delete the first line (the header) and get rid of the double quote around the gene names:

tommy@tommy-ThinkPad-T420:~/tommy$ cut -f3,4 topgenes_shTF1_FDR0.05_lfc1.txt | sed '1 d' | sed 's/"//g' | sort | uniq > shTF1_topgenes.txt

tommy@tommy-ThinkPad-T420:~/tommy$ cut -f3,4 topgenes_shTF2.txt | sed '1 d' | sed 's/"//g' | sort | uniq > shTF2_topgenes.txt


Inner join these two lists together produced a file with three columns. After join command, the file became white space delimited, I changed it back to tab delimited.

tommy@tommy-ThinkPad-T420:~/tommy$ join shTF1_topgenes.txt shTF2_topgenes.txt | tr ' ' '\t' > co_regulated.txt

only keep the co-upregualted and co-downregulated genes:

tommy@tommy-ThinkPad-T420:~/tommy$ cat co_regulated.txt | awk '{if($2>0 && $3>0 || $2<0 && $3<0) print}' | head
Abhd5 1.22313483440383 0.718522081750001
Acta2 2.42294712761151 0.602606251249999
Acta2 2.7279547932607 0.602606251249999
Acta2 2.92237008906738 0.602606251249999
Akt3 1.62314377335472 0.717014153000002
Akt3 1.62314377335472 0.900214081749999
Akt3 1.62314377335472 1.10733403775
Akt3 2.21617654271768 0.717014153000002
Akt3 2.21617654271768 0.900214081749999
Akt3 2.21617654271768 1.10733403775

There are multiple probes for the same gene, only keep the first instance for each gene:

tommy@tommy-ThinkPad-T420:~/tommy$ cat co_up_or_down.txt | awk '!a[$1]++' > co_up_or_down_uniq.txt

Among these 116 genes, 85 are co-upregulated, 31 are co-down regulated.
Now, I can draw a heatmap based on the log2 fold changes.



The code above are heavily commented, you should be able to follow easily.
I once asked about the colour setting in the heatmap in Seqanwser http://seqanswers.com/forums/showthread.php?t=33023
Also see here http://stackoverflow.com/questions/17820143/how-to-change-heatmap-2-color-range-in-r
In the script, I mapped the colours to values:

bk = unique(c(seq(-5.73,-0.5, length=100),seq(-0.5,0, length=100), seq(0,5,length=100)))
hmcols<- colorRampPalette(c("green","black", "red"))(length(bk)-1)

So, the black and green coloured cells are down-regulated, the red cells are up-regulated.
After add the side bar:
In the dendrogram, I split the genes to two groups (co-up and co-down). In this particular example, I already pre-selected co-up and co-down regulated genes before I constructed the heatmap. If one did not do this and wants to find the clustered (heatmap.2 use hierarchical clustering as default, type ?hclust in R to see more) groups of gene members, one can find them by the code above.


The easiest way is to  use cluster3.0 to cluster and Java tree viewer to visualize the result http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
It is interactive, you can choose the sub-groups just by selecting  them directly using mouse.


Heatmap is no mystery. It just uses colours to represent the values for visualization. However, you need to know what you are doing. Make sure you know the different distance calculations among rows http://stat.ethz.ch/R-manual/R-patched/library/stats/html/dist.html
and the different clustering methods (hierarchical, K-means etc)
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/hclust.html
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html

See also my previous post http://crazyhottommy.blogspot.com/2013/08/how-to-make-heatmap-based-on-chip-seq.html

Tuesday, December 17, 2013

ChIP-seq peaks overlapping significance test

The question is straightforward:

I have two ChIP-seq bed files (transcription factor binding) obtained by MACS peak calling, and I want to know whether these two transcription factors co-bind to the same sites with a probability greater than expected.

The answer is not trivial.
See a  post here http://www.biostars.org/p/5484/

update on 10/30/2014: one can use an R package for this genometricorr

One can use bedtools shuffle http://bedtools.readthedocs.org/en/latest/content/tools/shuffle.html developed by Aaron Quinlan@UVA as suggested by the link above,  but it is not automatic. 
The coming bedtools2 seems to have this function implemented http://www.ashg.org/2013meeting/abstracts/fulltext/f130122205.htm

"In the context of biological discovery, the most exciting new functionality in bedtools2 is a comprehensive set of statistical measures for revealing associations between sets of genomic features (e.g., do these transcription factors co-associate more than expected by chance?). Here we present the new statistical tests in bedtools2, compare our tests to existing approaches such as the ENCODE Genome Structure Correction (GSC) metric, and provide needed insights into which metrics are most appropriate to common biological questions. We demonstrate typical misuses of these metrics and illustrate how our tests and associated visualization tools can reveal new biological insights. Given the speed and analytical flexibility of bedtools2, we anticipate that our new toolkit will be an invaluable resource for geneticists studying the impact of genetic variation and regulatory elements on human disease phenotypes."

I am looking forward to it.
Luckily, I found other two tools that automate this process and are very handy to use.

poverlap https://github.com/brentp/poverlap
and 
Genomic Association Tester (GAT) http://www.cgat.org/~andreas/documentation/gat/contents.html

I will use GAT as a demonstration since it is very well documented and seems to be more powerful.

Following the tutorial here http://www.cgat.org/~andreas/documentation/gat/tutorialIntervalOverlap.html
I will need three bed files ready: a.bed file for TF1, b.bed file for TF2 and a workspace (genome) bed file that specifies the chromosome coordinates limits.

I have peak bed files for TFs, and I need to get the workspace bedfile.
See a post here http://www.biostars.org/p/16396/#16399
I downloaded the  chromInfo.txt.gz file here ( I am working with mouse data)
 http://hgdownload.soe.ucsc.edu/goldenPath/mm9/database/

Some manipulation to get the bed file:
tommy@tommy-ThinkPad-T420:~/mm9_genome_info$ cat chromInfo.txt
chr1 197195432 /gbdb/mm9/mm9.2bit
chr2 181748087 /gbdb/mm9/mm9.2bit
chr3 159599783 /gbdb/mm9/mm9.2bit
chr4 155630120 /gbdb/mm9/mm9.2bit
chr5 152537259 /gbdb/mm9/mm9.2bit
chr6 149517037 /gbdb/mm9/mm9.2bit
chr7 152524553 /gbdb/mm9/mm9.2bit
chr8 131738871 /gbdb/mm9/mm9.2bit
chr9 124076172 /gbdb/mm9/mm9.2bit
chr10 129993255 /gbdb/mm9/mm9.2bit
chr11 121843856 /gbdb/mm9/mm9.2bit
chr12 121257530 /gbdb/mm9/mm9.2bit
chr13 120284312 /gbdb/mm9/mm9.2bit
chr14 125194864 /gbdb/mm9/mm9.2bit
chr15 103494974 /gbdb/mm9/mm9.2bit
chr16 98319150 /gbdb/mm9/mm9.2bit
chr17 95272651 /gbdb/mm9/mm9.2bit
chr18 90772031 /gbdb/mm9/mm9.2bit
chr19 61342430 /gbdb/mm9/mm9.2bit
chrX 166650296 /gbdb/mm9/mm9.2bit
chrY 15902555 /gbdb/mm9/mm9.2bit
chrM 16299 /gbdb/mm9/mm9.2bit
chr13_random 400311 /gbdb/mm9/mm9.2bit
chr16_random 3994 /gbdb/mm9/mm9.2bit
chr17_random 628739 /gbdb/mm9/mm9.2bit
chr1_random 1231697 /gbdb/mm9/mm9.2bit
chr3_random 41899 /gbdb/mm9/mm9.2bit
chr4_random 160594 /gbdb/mm9/mm9.2bit
chr5_random 357350 /gbdb/mm9/mm9.2bit
chr7_random 362490 /gbdb/mm9/mm9.2bit
chr8_random 849593 /gbdb/mm9/mm9.2bit
chr9_random 449403 /gbdb/mm9/mm9.2bit
chrUn_random 5900358 /gbdb/mm9/mm9.2bit
chrX_random 1785075 /gbdb/mm9/mm9.2bit
chrY_random 58682461 /gbdb/mm9/mm9.2bit

tommy@tommy-ThinkPad-T420:~/mm9_genome_info$ cat chromInfo.txt | cut -f1-2 | awk '{print $1"\t"0"\t"$2}'
chr1 0 197195432
chr2 0 181748087
chr3 0 159599783
chr4 0 155630120
chr5 0 152537259
chr6 0 149517037
chr7 0 152524553
chr8 0 131738871
chr9 0 124076172
chr10 0 129993255
chr11 0 121843856
chr12 0 121257530
chr13 0 120284312
chr14 0 125194864
chr15 0 103494974
chr16 0 98319150
chr17 0 95272651
chr18 0 90772031
chr19 0 61342430
chrX 0 166650296
chrY 0 15902555
chrM 0 16299
chr13_random 0 400311
chr16_random 0 3994
chr17_random 0 628739
chr1_random 0 1231697
chr3_random 0 41899
chr4_random 0 160594
chr5_random 0 357350
chr7_random 0 362490
chr8_random 0 849593
chr9_random 0 449403
chrUn_random 0 5900358
chrX_random 0 1785075
chrY_random 0 58682461

tommy@tommy-ThinkPad-T420:~/Tet1/mm9_genome_info$ cat chromInfo.txt | cut -f1-2 | awk '{print $1"\t"0"\t"$2}' > mm9_contigs.bed

Notice the weird chromosome names like chrUn and chr4_random.  See a post here http://www.biostars.org/p/9577/
  • The chr*_random sequences are unplaced sequence on those reference chromosomes.
  • The chrUn_* sequences are unlocalized sequences where the corresponding reference chromosome has not been determined
Now, I have all the files ready, I will do the significance test by GAT:
On HPC@UFL:

[mtang@dev1 mm9_genome_info]$ gat-run.py --segments=a.bed --annotations=b.bed --workspace=mm9_contigs.bed --ignore-segment-tracks --num-samples=1000 --log=gat.log > gat.tsv

a.bed file contains about 30,000 peaks and  b.bed file contains 80,000 peaks. It took me 1 min to finish 1000 simulations:


observed expected CI95low CI95high stddev fold l2fold pvalue qvalue
3998559 282473.979 267444 298581 9379.0988 14.1554 3.8233 1.0000e-03 1.0000e-03

For 1000 simulation, I got a minimal p value 0.001 indicating that these two TFs co-occupy the same genomic sites not randomly.





Wednesday, December 11, 2013

Tet1 enrichment at CpG rich-promoters

Continue with my last post http://crazyhottommy.blogspot.com/2013/12/geoquery-to-access-geo-datasets.html, I am going to analyze the Tet1 ChIP-seq data in the paper Dual functions of Tet1 in transcriptional regulation in mouse embryonic stem cells.

To reproduce fig1, I have to group the promoters into CpG-rich and CpG-poor promoters first.

Get the CpG island bed file and the refseq gene table first.

I got the file from UCSC table browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start
choose Expression and regulation and CpG islands track, and download it.


similarly, get the refSeq gene table, send the output to Galaxy ( in order to get the -5kb promoter
regions, one needs to do some manipulation of the file, several options are there: bedtools, awk and Galaxy. The easiest way is through Galaxy if you are not familiar with linux command line)


Click get franks, and get the 5kb upstream coordinates, and cut columns that we need:
tommy@tommy-ThinkPad-T420:~/Tet1$ head 5kb_promoter_mm9.bed
chr1 134207701 134212701 NM_028778 +
chr1 134207701 134212701 NM_001195025 +
chr1 58752833 58757833 NM_175370 -
chr1 9289811 9294811 NM_027671 -
chr1 25886552 25891552 NM_175642 -
chr1 33726639 33731639 NM_008922 -
chr1 125942136 125947136 NM_199021 -
chr1 75503027 75508027 NM_178884 -
chr1 109057691 109062691 NM_198680 -
chr1 184490137 184495137 NM_130890 +

you can do the same thing with CpG islands with Galaxy, but I show you the linux command below:
tommy@tommy-ThinkPad-T420:~/Tet1$ head mm9_CpG_island.txt
#bin chrom chromStart chromEnd name length cpgNum gcNum perCpg perGc obsExp
107 chr1 36568608 36569851 CpG: 128 1243 128 766 20.6 61.6 1.09
151 chr1 82313020 82313491 CpG: 56 471 56 319 23.8 67.7 1.06
234 chr1 169213895 169215283 CpG: 145 1388 145 958 20.9 69 0.88
611 chr1 3521705 3521924 CpG: 27 219 27 167 24.7 76.3 0.86
612 chr1 3660700 3661155 CpG: 34 455 34 293 14.9 64.4 0.72
612 chr1 3661735 3662237 CpG: 45 502 45 348 17.9 69.3 0.75
619 chr1 4481782 4483754 CpG: 165 1972 165 1286 16.7 65.2 0.79
619 chr1 4487028 4487689 CpG: 47 661 47 393 14.2 59.5 0.81
619 chr1 4561722 4562156 CpG: 44 434 44 265 20.3 61.1 1.09

I only need the second, third and fourth  column of the CpG file:
tommy@tommy-ThinkPad-T420:~/Tet1$ cat mm9_CpG_island.txt | cut -f2-4 > mm9_CpG.bed

tommy@tommy-ThinkPad-T420:~/Tet1$ head mm9_CpG.bed
chr1 36568608 36569851
chr1 82313020 82313491
chr1 169213895 169215283
chr1 3521705 3521924
chr1 3660700 3661155
chr1 3661735 3662237
chr1 4481782 4483754
chr1 4487028 4487689
chr1 4561722 4562156
chr1 4679265 4679478

How many CpG island and refGenes are there?
tommy@tommy-ThinkPad-T420:~/Tet1$ wc -l 5kb_promoter_mm9.bed mm9_CpG.bed 
  32555 5kb_promoter_mm9.bed
  16026 mm9_CpG.bed
  48581 total

Now use bedtools (you can also use the intersect, substract tool in Galaxy) to get promoters that contain CpG island:
tommy@tommy-ThinkPad-T420:~/Tet1$ bedtools intersect -a 5kb_promoter_mm9.bed -b mm9_CpG.bed -wa -u | wc -l
16171
tommy@tommy-ThinkPad-T420:~/Tet1$ bedtools intersect -a 5kb_promoter_mm9.bed -b mm9_CpG.bed -wa -u > CpG_rich_promoters.bed

promoters without CpG island:
tommy@tommy-ThinkPad-T420:~/Tet1$ bedtools intersect -a 5kb_promoter_mm9.bed -b mm9_CpG.bed -v | wc -l
16384
tommy@tommy-ThinkPad-T420:~/Tet1$ bedtools intersect -a 5kb_promoter_mm9.bed -b mm9_CpG.bed -v > CpG_poor_promoters.bed

I have mapped the Tet1 ChIP-seq raw sequence data by bowtie at HPC@UFL. I got the sam file and converted it to sorted bam file with samtools. Then, I called peaks with MACS1.4, and got the peaks and bedgraph file that I can visualize in IGV genome browser (see my previous post http://crazyhottommy.blogspot.com/2013/11/chip-exo-data-analysis.html).

A partial reproduction of fig1b in the paper:

the first track is the IgG control, the second is the Tet1 bedgraph signal, the third is the Tet1 peaks called by MACS, and the last one is the mm9_CpG.bed track. I observed many CpG islands are overlapped with the Tet1 peaks. The peaks are very sharp indicating a successful ChIP-seq experiment.

There are total 30946 peaks that are called by MACS. The number is 35,564 in the paper, but they used an older version of MACS. Nevertheless, the numbers are very close.

tommy@tommy-ThinkPad-T420:~/Tet1$ wc -l Tet1_ChIP-seq_peaks.bed 
30946 Tet1_ChIP-seq_peaks.bed

using CEAS http://liulab.dfci.harvard.edu/CEAS/usermanual.html from Shirely Liu's lab, I investigated the distribution of the peaks:
tommy@tommy-ThinkPad-T420:~/Desktop/Gene_Annotation_for_CEAS_package$ ceas -g mm9.refGene -b ~/Tet1/Tet1_ChIP-seq_peaks.bed 
Most peaks are located at gene-rich regions, only 23.3% of them are located at distal intergenic regions. It will be very interesting to study the function of these distal Tet1 binding sites.
A significant portion of peaks (33.2%) are at introns, indicating that Tet1 plays a role in DNA methylation in the gene bodies.
Note that CEAS can also produce figures as in fig1a, but it requires a wig file as input. I will stick to my python script which only need a bam file ( to convert bam/sam files to wig file, see here 

To reproduce fig1a, I have all the files ready: a sorted Tet1 ChIP-seq bam file, CpG_rich_promoters.bed and CpG_poor_promoters.bed
See the python code below, it uses HTSeq python package written by Simon Anders. http://www.embl.de/research/units/genome_biology/huber/members/?s_personId=CP-60002977. He is an expert in RNA-Seq analysis ( He wrote the famous DESeq package)

it generates a figure as shown below:
One can count the mapped the reads number by samtools flagstat first, and then plug in the number in the above script. It is pretty slow to count the mapped reads by the aligned_counts function above.
I tricked by counting the reads first by samtools and the whole script only took me 3 minutes to run which is pretty fast!
Different from fig1a in the original paper, I used normalized counts per million for Y axis. I do not know why the authors used the peak P value as Y axis. Anyway, the conclusion is still the same: Tet1 preferentially binds to CpG-rich promoters. A close inspection of the figure generated by ipython ( interactive visualization) reveals that Tet1 peaks at around 100bp downstream of TSS.

I also reproduced part of Fig3a where the authors show Tet1 binds to bivalent promoters:


Annotate the peaks by Homer:
tommy@tommy-ThinkPad-T420:~/Tet1$ annotatePeaks.pl Tet1_ChIP-seq_peaks.bed mm9 -noann > Tet1_targeted_genes.txt

Homer finds the closest TSS for each peak. If I set the closest distance limit to 10kb upstream and downstream of TSS, there are still around ~12,000 genes that are associated with these peaks (total 30946 peaks).

the 10th column is the distance to TSS, the 16th column is the gene name
tommy@tommy-ThinkPad-T420:~/Tet1$ cat Tet1_targeted_genes.txt | awk '{ if ($10<10000 &&  $10>-10000) print}'| cut -f16 | sort | uniq | wc -l
12202


Summary:
1. It is a very successful ChIP-seq experiment with sharp peaks over IgG background. Majority of Tet1 binding sites are at gene-rich regions. 23.3% of Tet1 binding sites are in distal intergenic regions with unknow functions. Tet1 tends to bind to CpG-rich promoters.

2. If one assigns the nearest genes to the peaks, ~12,000 genes are associated with Tet1 binding sites. compare with the differentially expressed genes in my last post-only 1292 genes (~10%) are affected by Tet1 KD- the number of genes that are regulated by Tet1 is much smaller than the peak associated ones.

Several reasons: KD is not complete (~20% residue Tet1 remains according to the microarray data), the remaining Tet1 can still maintain expression of some genes. If we knock out Tet1 (by TALENT or CRISPR http://zlab.mit.edu/publications.html), it may have a bigger effect on gene expression. In addition, RNA-seq may reveal more differentially expressed genes compared with microarray.

Another reason is that binding does not infer function (especially for weak binding sites) and I think this is probably the mostly likely reason. Several papers have shown that changes of adjacent TF binding poorly correlates with gene expression change:


Extensive Divergence of Transcription Factor Binding in Drosophila Embryos with Highly Conserved Gene Expression 

http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1003748

  

Transcription Factors Bind Thousands of Active and Inactive Regions in theDrosophila Blastoderm

The Functional Consequences of Variation in Transcription Factor Binding

http://arxiv.org/abs/1310.5166

" On average, 14.7% of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. "

To assign functional binding sites to their target genes, one may consider to use BETA developed in Shirely Liu's lab http://cistrome.org/BETA/. It integrates ChIP-seq data and gene expression data to infer  the TF target genes.
You may be also interested in this paper:

Assessing Computational Methods for Transcription Factor Target Gene Identification Based on ChIP-seq Data 

http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003342#pcbi.1003342.s019

That's all for today! I will continue to analyze the DNA methylation data.



Saturday, December 7, 2013

GEOquery to access GEO datasets

I was reading a paper Dual functions of Tet1 in transcriptional regulation in mouse embryonic stem cells , and wanted to re-analyze the microarray data.
In the paper, Hao wu et.al. knocked down Tet1, a protein can convert 5-methylcytosine to 5-hydroxymethlycytosine, and looked at gene expression change in mouse ES cells.

It was my first time to look at mouse microarray data set, but I was aware of that GEOquery, a bioconductor package, can do the job very easily. I will lay down the R code first.


some output from the code:

> show(pData(phenoData(gse[[1]]))[,c(1,6,8)])
                                              title type                   source_name_ch1
GSM659775                          Control KD, rep1  RNA         control KD mouse ES cells
GSM659776                          Control KD, rep2  RNA         control KD mouse ES cells
GSM659777                          Control KD, rep3  RNA         control KD mouse ES cells
GSM659778                          Control KD, rep4  RNA         control KD mouse ES cells
GSM659779                             Tet1 KD, rep1  RNA            Tet1 KD mouse ES cells
GSM659780                             Tet1 KD, rep2  RNA            Tet1 KD mouse ES cells
GSM659781                             Tet1 KD, rep3  RNA            Tet1 KD mouse ES cells
GSM659782                             Tet1 KD, rep4  RNA            Tet1 KD mouse ES cells
GSM659783 Tet1 KD + Nanog overexpression (OE), rep1  RNA Tet1 KD + Nanog OE mouse ES cells
GSM659784 Tet1 KD + Nanog overexpression (OE), rep2  RNA Tet1 KD + Nanog OE mouse ES cells
GSM659785 Tet1 KD + Nanog overexpression (OE), rep3  RNA Tet1 KD + Nanog OE mouse ES cells
GSM659786 Tet1 KD + Nanog overexpression (OE), rep4  RNA Tet1 KD + Nanog OE mouse ES cells


I am only interested in gene expression change after Tet1 Knock down, so I set coef =1 in the command ( or you can specify a vector c(1,2,3) to look at comparisons among all three groups)

topgenes<- topTable(fit2, coef=1,number=2000, p.value=0.23, lfc=0.6, adjust="BH")

The original paper claims that there are total 1332 genes that are differentially expressed (788 upregulated and 544 downregulated in Tet1 KD cells) with FDR=0.05 ( they used another bioconductor package called NIA array analysis tool.  I am not sure what are the fold changesfor those genes). However, I only found 610 probes ( less genes, some genes have multiple probes) are differentially expressed with adjusted p.value = 0.05 ( the p.value argument in the above command is  actually the adjust p.value which is similar with the FDR) and a minimal fold change of 1.5 ( log fold change = 0.6, 2^0.6 = 1.51).

Nevertheless, I went on and sanity checked my gene list. It is a Tet1 KD experiment, so Tet1 should be downregulated.

head(topgenes)
                       ID Symbol     logFC   AveExpr         t      P.Value    adj.P.Val        B
1425532_a_at 1425532_a_at   Bin1  1.481071  7.455563  12.40969 1.996707e-08 0.0009005347 9.103139
1429448_s_at 1429448_s_at   Tet1 -2.246819 10.157653 -11.13936 6.989828e-08 0.0010641629 8.115987
1434369_a_at 1434369_a_at  Cryab  3.615875  8.073085  11.12716 7.078532e-08 0.0010641629 8.105799
1455425_at     1455425_at   Tet1 -2.145540 11.007240 -10.83405 9.615218e-08 0.0010841398 7.856887
1452253_at     1452253_at  Crim1  1.629262  5.253925  10.40753 1.520152e-07 0.0010997147 7.479479
1450780_s_at 1450780_s_at  Hmga2  2.138363  9.856595  10.36264 1.596633e-07 0.0010997147 7.438677

It showed up in the top list. logFC is -2.24 and -2.14 for two probes, demonstrating a really good KD. p value and adjust p value are both very small. This result reassured the correctness of my analysis.

In figure 4c, the authors used several gene examples to show Ezh2 binding is impaired after Tet1 KD, but they did not show the RT-qPCR result ( or at least need to check the microarray data). Those genes are supposed to be upregulated after Tet1 KD.

Of all the 8 examples, Lhx2, Eomes, Cdx2 did not show up in my gene list (even after I set the adjust p.value to 0.4). The others- Cyr61, Hoxa1, Sox17, Pcdh8, Gata6 are all upregulated in my list with relatively big adjust p value (0.1).
                     ID Symbol     logFC   AveExpr         t      P.Value    adj.P.Val        B

1457823_at     1457823_at  Cyr61 0.8698561 5.751129 3.763281 2.498490e-03 0.097058042 -1.4601779
1420565_at     1420565_at  Hoxa1 0.9393792   8.349 3.755187 0.002536592   0.09759434 -1.474719
1421657_a_at 1421657_a_at  Sox17 1.383838 6.663369 4.282790 9.577202e-04 0.057282062 -0.5375835
1447825_x_at 1447825_x_at  Pcdh8 1.388784 7.709455 3.422685 0.004741331    0.1354996 -2.074440
1417051_at     1417051_at  Pcdh8 1.480749 7.636717 3.380923 0.005131004    0.1412807 -2.149954
1425463_at 1425463_at  Gata6 0.8957893 6.507380 3.872267 0.0020389307    0.08715337 -1.2648024
It is not surprising that Tet1 maintains the expression of a group of active genes by keeping the promoters hypomethylated. Interestingly, Tet1 also maintains hypomethlyation state of bivalent gene promoters that favours PRC2, a silencing complex, to bind to theses promoters. Consequently, Tet1 contributes to the repression of Polycomb-targeted genes.
With this take home message in mind, I have several questions concerning the paper:
1. in figure 4, the authors showed several example genes that are repressed by Tet1. Are all these genes are bivalent? The authors only showed a H3k27Me3 track in figure 4b. I would at least do a ChIP-qPCR to confirm H3k4me3 and H3k27me3 occupancy at these promoters.
2. what is the exact mechanism that Tet1 represses gene expression?
The authors provided an explanation that Tet1-mediated hypomethylation is required for PRC2 silencing complex recruitment at bivalent promoters ( at least in figure 4c, the primers for Ezh2 ChIP-qPCR located at promoters). However, many shaded regions are in gene bodies. Note that the scale of the genomic regions is big. A close inspection reveals that Lhx2, Cyr61, Hoxa1, Pcdh8 gained DNA methylation in the gene bodies. It partially explains the upregulation of the gene expression after Tet1 KD (DNA methlyation in the gene body positively correlates with active gene expression). Do H3k27me3 occupancies decrease at these promoters? A ChIP-qPCR in Tet1 KD cells and control cells will answer this question.
In contrast, Eomes, Sox17 and Gata6 gained DNA methylation at promoters which may prevent the PRC2 binding and thus increase the expression. Do these genes also loss H3k27me3? It is still hard for me to believe that gain of DNA methylation at promoters can increase the gene expression. Histone modifications are more reversible, but DNA methylation is hard to reverse and is considered as a late stage of tumor suppressor silencing during tumor progression.
A quick google I found this http://epigenie.com/dna-methylation-in-gene-activation-with-dr-robert-waterland/
3'CpG island methlyation activates gene expression by interfering  enhancer blocking activity mediated by CTCF, an insulator protein that blocks enhancers from activating unintended targets. I worked on CTCF for a while, and knew  that DNA methylation abolishes CTCF binding at Igf2/H19 imprinting control region. It makes sense that 3'CpG methylation can enhancer gene expression by disrupting CTCF binding, thus allows 3' enhancers to induce the gene expression. Same rationale could be applied to 5'CpG island though. 5' enhancers are made accessible if the 5'CpG island is methylated resulting in a diminished CTCF binding at promoter region. Collectively, It is context dependent that whether DNA methylation can increase or decrease gene expression.   
Apparently, we have not  fully understood the function of DNA methylation, although it is commonly thought to contribute to gene silencing. However, we often find exceptions in biology, and  the exceptions lead to good papers if you have enough evidence to support. That's the beauty of science.

3. For heatmap in figure 3c, DNA methlyation data did not plot together with the histone modification profile and gene expression profile. Ideally, the Tet1-activated targets gain DNA methlyation at promoters after Tet1 KD, but I guess the pattern did not look good (see point 2, many are gained at gene bodies). There are many reasons for that. one of the reasons is that the upregulation of these group of genes are indirect effect of Tet1 KD, although the Tet1 ChIP-seq showed binding at promoters. It will be interesting to examine where the gain of DNA methylation occur at the Tet1-activated targets and where the loss of DNA methylation occur at the Tet1-repressed targets in terms of promoters or gene bodies ( I bet the authors have done that).
Several other genes mentioned in the paper:
                      ID Symbol    logFC  AveExpr        t      P.Value   adj.P.Val        B
1423691_x_at 1423691_x_at   Krt8 3.163397 10.06588 8.237845 2.023202e-06 0.002343863 5.243374
1435989_x_at 1435989_x_at   Krt8 2.655900 10.08545 7.189101 8.477813e-06 0.004726375 3.942816
1420647_a_at 1420647_a_at   Krt8 2.956087 10.54748 7.154126 8.913416e-06 0.004902488 3.896674
1421657_a_at 1421657_a_at  Sox17 1.383838 6.663369 4.282790 9.577202e-04 0.057282062 -0.5375835
1429388_at 1429388_at  Nanog -1.381565 12.00124 -3.010861 0.01035011 0.2014412 -2.817704
1441921_x_at 1441921_x_at  Esrrb -1.00944 9.764298 -3.406862 0.004885327 0.1374205 -2.103051
1422458_at 1422458_at   Tcl1 -1.30934 10.30728 -2.86801 0.01356834 0.2250628 -3.073405
Note that Nanog and Tcl1 have relatively big adjust p value (~0.2). A  false discovery rate of 0.2 means in 100 genes, there are 20 will be false positives.
Reading this post on Biostar http://www.biostars.org/p/18470/  cleared some of my doubts on FDR choosing. The bottom line is that there is no "right" cut-off for FDR. As long as the result make biological sense, we can play around with it.
In my analysis, I set the FDR to 0.23 to include Nanog and Tcl1. I have total 1292 (1067/1332 have a Tet1 binding within 5kb up- or down-stream of TSS in the paper ) unique genes that are differentially expressed which is pretty close to the original paper. Among them, 828 (677 in the paper)  are upregulated while 464(390 in the paper) are downregulated. This is consistent with the paper: more Tet1 targets are upregulated rather than downregulated with Tet1 depletion. I will need to work on the Tet1 ChIP-seq data to determine the exact number.

tommy@tommy-ThinkPad-T420:~/Tet1$ cat  topgenes_Fdr_0.23.txt | cut -f 3 | sort | uniq | grep -v "NA" | wc -l
1292
tommy@tommy-ThinkPad-T420:~/Tet1$ cat  topgenes_Fdr_0.23.txt | grep -v "NA" > topgenes_Fdr_0.23_no_NA.txt
tommy@tommy-ThinkPad-T420:~/Tet1$ cat topgenes_Fdr_0.23_no_NA.txt | cut -f 3,4 | awk '!a[$1]++' | awk '{if($2>0) a++; else b++} END{ print a,b}' 828 464
Look at the distribution of the p value and adjust p values.
> hist(topgenes$adj.P.Val)
> hist(topgenes$P.Val)
P values are all very small. 
For downstream analysis, read this post from Stephen Turner: http://gettinggeneticsdone.blogspot.com/2012/03/pathway-analysis-for-high-throughput.html
Conclusions:
1. It is a successful Tet1 KD experiment with a very good Tet1 KD and many differentially expressed genes.
2. There is no "right" way to analyze the data. Analyzing the same data by different people and different software may give different results, and it is not surprising to me. Actually, for the RNA-seq analysis, many different tools are developed ( most famous ones are cuff-diff, DESeq), and they all give different list of differentially expressed genes.  See papers here:
http://genomebiology.com/2013/14/9/R95
http://www.biomedcentral.com/1471-2105/14/91
Thus, it is critical to provide the raw code, the parameters and the software version etc to ensure a reproducible research. 
3. conclusions drawn from the global analysis tend to be "exaggerated", many patterns we observe are  only true for a sub set of data sets, and if we combine all the things together, the patterns conflict each other somehow. That's why when we come down to several genes to verify and those are the best examples ( I saw many global Genomic study papers use examples with "weird" gene names).  
These are my little thoughts, and I am happy to discuss with the authors.

Tuesday, December 3, 2013

explain shell

I just got to know this website from twitter explain shell
It is very useful, especially when I have a complex linux command and can not figure out what
does each flag mean.

For example:  du -h -d 1 ~

This command is very useful to check the folder sizes.
you can sort the folder by size:

tommy@tommy-ThinkPad-T420:~$ du -h -d 1 | sort -hr
63G .
16G ./homer
14G ./Desktop
13G ./Downloads
3.3G ./MochiView_v1.46
1.9G ./anaconda
1.2G ./scientific_writing
1022M ./.cache
938M ./SeqMonk
845M ./mochi_view
625M ./jorg_bungert
587M ./.thunderbird
400M ./.local
379M ./R
331M ./Datasets
257M ./gatk-protected
........