Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, January 28, 2014

Thursday, January 23, 2014

MeDIP-seq and histone modification ChIP-seq analysis

I was reading a cell paper:

Dnmt3L Antagonizes DNA Methylation at Bivalent Promoters and Favors DNA Methylation at Gene Bodies in ESCs
I want to re-analyze the MeDIP-seq data and the histone modification ChIP-seq data.

I downloaded the MeDIP-seq data here:
and the H3k4me3 and H3k27me3 ChIP-seq data here:

The paper claimed that Dnmt3L maintains DNA hypomethylation at bivalent promoters ( with both H3k4me3 and H3k27me3 marks).  The authors used a Dnmt3L deficient mouse ES cell line and did MeDIP-seq on it. They found that many bivalent promoters gain DNA methylation in Dnmt3L negative cells. In contrast, DNA methylation loss mainly occurs in gene body.

I will work on the histone modification ChIP-seq data first.
I took a look at many different H3k4me3 and H3k27me3 ChIP-seq data sets in mES cells. H3k4me3 ChIP-seq data quality is usually very good with very specific and sharp peaks after MACS peak calling. However, H3k27me3 data are not that great. I did regular H3k4me3 ChIP, the antibody is very good and the enrichment of input is usually above 10% in the active loci I tested.

Similar to H3k36me3, an active mark in gene body, H3k27me3 (repressive) mark usually form broader but less-enriched regions. I played around with the MACS argument setting.

I will skip the process of transforming sra format data to fastq and mapping them to the reference genome. I have several posts for that already:
bowtie for ChIp-seq
Tophat for RNA-seq

For H3k4me3 data, I only downloaded one, and called peaks:
[mtang@dev1 H3k4me3]$ macs -t SRR038984.sam -n initial_test -g mm 
INFO  @ Thu, 23 Jan 2014 13:37:47: 
# name = initial_test
# format = AUTO
# ChIP-seq file = SRR038984.sam
# control file = None
# effective genome size = 1.87e+09
# band width = 300
# model fold = 10,30
# pvalue cutoff = 1.00e-05
# Large dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 10000 bps

INFO  @ Thu, 23 Jan 2014 13:37:47: #1 read tag files... 
INFO  @ Thu, 23 Jan 2014 13:37:47: #1 read treatment tags... 
INFO  @ Thu, 23 Jan 2014 13:37:47: Detected format is: SAM 
INFO  @ Thu, 23 Jan 2014 13:37:56:  1000000 
INFO  @ Thu, 23 Jan 2014 13:38:05:  2000000 
INFO  @ Thu, 23 Jan 2014 13:38:13:  3000000 
INFO  @ Thu, 23 Jan 2014 13:38:21:  4000000 
INFO  @ Thu, 23 Jan 2014 13:38:29:  5000000 
INFO  @ Thu, 23 Jan 2014 13:38:38:  6000000 
INFO  @ Thu, 23 Jan 2014 13:38:46:  7000000 
WARNING @ Thu, 23 Jan 2014 13:38:56: NO records for chromosome chr16_random, minus strand! 
INFO  @ Thu, 23 Jan 2014 13:38:56: #1 tag size is determined as 36 bps 
INFO  @ Thu, 23 Jan 2014 13:38:56: #1 tag size = 36 
INFO  @ Thu, 23 Jan 2014 13:38:56: #1  total tags in treatment: 5000614 
INFO  @ Thu, 23 Jan 2014 13:38:56: #1 user defined the maximum tags... 
INFO  @ Thu, 23 Jan 2014 13:38:56: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) 
INFO  @ Thu, 23 Jan 2014 13:38:57: #1  tags after filtering in treatment: 3982997 
INFO  @ Thu, 23 Jan 2014 13:38:57: #1  Redundant rate of treatment: 0.20 
INFO  @ Thu, 23 Jan 2014 13:38:57: #1 finished! 
INFO  @ Thu, 23 Jan 2014 13:38:57: #2 Build Peak Model... 
INFO  @ Thu, 23 Jan 2014 13:39:15: #2 number of paired peaks: 22337 
INFO  @ Thu, 23 Jan 2014 13:39:16: #2 finished! 
INFO  @ Thu, 23 Jan 2014 13:39:16: #2 predicted fragment length is 164 bps 
INFO  @ Thu, 23 Jan 2014 13:39:16: #2.2 Generate R script for model : initial_test_model.r 
INFO  @ Thu, 23 Jan 2014 13:39:16: #3 Call peaks... 
INFO  @ Thu, 23 Jan 2014 13:39:16: #3 shift treatment data 
INFO  @ Thu, 23 Jan 2014 13:39:17: #3 merge +/- strand of treatment data 
INFO  @ Thu, 23 Jan 2014 13:39:18: #3 call peak candidates 
INFO  @ Thu, 23 Jan 2014 13:40:14: #3 use self to calculate local lambda and  filter peak candidates... 
INFO  @ Thu, 23 Jan 2014 13:40:27: #3 Finally, 32919 peaks are called! 
INFO  @ Thu, 23 Jan 2014 13:40:27: #4 Write output xls file... initial_test_peaks.xls 
INFO  @ Thu, 23 Jan 2014 13:40:27: #4 Write peak bed file... initial_test_peaks.bed 
INFO  @ Thu, 23 Jan 2014 13:40:27: #4 Write summits bed file... initial_test_summits.bed 
INFO  @ Thu, 23 Jan 2014 13:40:27: #5 Done! Check the output files! 

MACS successfully built the model and identified 32919 peaks.

For H3k27me3 data, there are two replicates, I merged them with samtools merge first

[mtang@dev1 H3k27me3]$ samtools merge merged_H3k27me3.bam SRR038975.bam SRR038976.bam 

then at HPC@UFL:

[mtang@dev1 H3k27me3]$ macs -t merged_H3k27me3.bam  -n initial_test -g mm 

it gave me only 4941 peaks.

[mtang@dev1 H3k27me3]$ wc -l merged_H3k27me3_peaks.bed 
4941 merged_H3k27me3_peaks.bed

I changed the arguments:
[mtang@dev1 H3k27me3]$ macs -t merge.bam -n no_model  -g mm --nomodel --shiftsize 73 --pvalue 1e-3 

note that I specify no peaking model building, shift size 73 (  fragment size 73x2=146 bp, a nuclesome size), and a less stringent pvalue 1e-3 ( the default is 1e-5)

This time I got 24522 peaks.

[mtang@dev1 H3k27me3]$ wc -l no_model_peaks.bed 
24522 no_model_peaks.bed

In my next post, I will use those two peak bed files generated by MACS to find the bivalent promoters.

I will switch gear and analyze the MeDIP-seq data. MeDIP-seq data are essentially similar to ChIP-seq data. Although there are specific software to analyze it, MACS  works well with this kind of data.

see a list of programs here

I have  MeDIP_shGFP.sam and MeDIP-shDnmt3L.sam files ready. 

To visualize the MeDIP data in IGV, I have the bedgraph files generated by MACS respectively. I also want to have a bedgraph file that with MeDIP-shGFP signal subtracted from MeDIP-shDnmt3L signal.

MACS2 has a bdgcpm program for that purpose

macs2 bdgcmp -t shDnmt3L_treat_afterfiting_all.bdg -c shGFP_treat_afterfiting_all.bdg -o subtracted.bdg -m subtract

Now, I have almost every file for visualization. Just to reproduce fig4E in the paper:
The first track is the MeDIP-seq signal for shGFP; the second is the MeDIP-seq signal for shDnmt3L; the third is the shDnmt3L signal subtracted with shGFP signal; the fourth track is the H3k4me3 ChIP-seq signal; the fifth is the H3k4me3 peaks called by MACS; the sixth is the H3k27me3 ChIP-seq signal; the last is the H3k27me3 peaks called by MACS.

The next question is to find the differentially changed peaks after ablating Dnmt3L. I have a post discussing this topic before

MACS2 has some new implements for this kind of purpose, but it requires a IgG or Input control file for each condition.

I only have two bam files, so I will use
diffReps needs  bed files to be the input. I converted the bam files to bed files with bedtools first:

[mtang@dev1 MeDIP-seq]$ bamToBed -i SRR764931_shGFP.bam > SRR764931_shGFP.bed
[mtang@dev1 MeDIP-seq]$ bamToBed -i SRR764932_shDnmt3L.bam > SRR764932_shDnmt3L.bed

since I do not have biological replicates, I used G-test 

[mtang@dev1 MeDIP-seq]$ -tr SRR764932_shDnmt3L.bed -co SRR764931_shGFP.bed -gn mm9 -re diff.nb.txt -me gt

It generates three files:

for more information of these three files, see the diffReps link above.

Monday, January 20, 2014

ChIP-seq analysis programs

This is the first blog in the year of 2014.  I am graduating in the coming August, there are so many things to do and I do not have much time to blog. However, I do not want it to be inactive.

I have extensive experience with ChIP-seq data analysis, and I want to list all the programs that I usually use.

With raw Fastq data, one needs to align it to the reference genome by an aligner. I usually use Bowtie, others like BWA (developed by Heng Li, a legendary bioinformatician), MAQ, SOAP etc are also very popular (

Many times, I download data from SRA, sra toolkit is used to convert the sra format to fastq, and then one can align it with bowtie.

After mapping with Bowtie, one get a sam file samtools (also developed by Heng Li) is used to convert it to bam file, a binary form of sam file.

MACS is the most widely used peak calling program developed by Tao Li, previously in shirley Liu's lab at Harvard. MACS2 now can detect differentially changed peaks (see the link above). It also generates a bedgraph file to be visualized by IGV

For motif enrichment analysis:
GREAT for GO analysis
Homer, CEAS or PAVIS to annotate peaks

you might be interested in my previous blogs also:

2014 will be an exciting year for me! Follow me and many others I follow on twitter