Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, May 27, 2014

MACS for Dnase-seq data

It has been a long time since I posted last time. I will be defending at the end of next month. So, I am busy with my thesis writing. Anyway, I do want to keep a note here for using MACS to identify Dnase-seq peaks.

Although MACS was developed for ChIP-seq. It can also be used for Dnase-seq data. see a discussion here:
https://groups.google.com/forum/#!topic/macs-announcement/D1ZlzIJMBB8

several responses from the author of MACS Tao Liu:
Hi Zoello and Batool,
Any extension size is better than no extension at all. Even if there is no meaningful fragment size, signal pileup and data smoothing are still essential for peak detection algorithm. If you consider every tag only represents 0bp fragment, how can you decide where the enrichment is?
As for DNAseI hypersensitive studies, there are two scenarios. Using human ENCODE data as example, data from Duke university and University of Washington are generated by two different protocols. The key difference is the depth of digestion. DNA fragment captured is smaller in UW library comparing to Duke library due to different levels of digestion. In UW library, deep digestion and a gel cut for more smaller fragments can make sure the sequencing ends enriched at the boundaries of regions where the DNA is less accessible by the enzyme. These regions are more likely protected from DNaseI because of protein (TF, histone or other chromatin factors) binding in nuclei. So the following analysis can be considered similar to ChIP-seq where sonication tends to attack the boundaries of TF binding sites. In this case, tag extension towards 3' direction with hundreds of basepairs either from MACS prediction or an arbitrary setting, would work perfect. You can simply apply MACS on this kind of data. At the end, you will more likely predict where the DNA footprints are. However the sequencing tags from Duke are ends of bigger DNA fragments. So tag extension towards 3' direction may less likely reach where the real footprint is. In this case, the aim of the study should be to look for where the DNAseI hypersensitive sites are instead of to find footprints. My opinion is to extend every tag towards both 5' and 3' directions then pile them up, therefore at the end, the regions with more pileup would be more vulnerable to DNAseI digestion. If you want to use MACS for this purpose, you may need to manipulate the raw data then turn off model building in MACS.
That's my point of view. If anyone has difference opinion, please let us know."

"Of course you can use MACS2 on DNAse-seq data analysis. As for Hotspot, it uses a 250bps sliding window for calculation tag enrichment which is equivalent to a fixed 250bps extension in MACS2 setting.  As for number of peaks, it mainly depends on cutoff. Although DHS sites range from narrow to broad, there may still be an intrinsic fragment size in the library -- check your _model.pdf file from MACS2 or try Anshul's PhantomPeak tool. After all, this 'fragment size' is a factor for smoothing method, and an approximately correct smoothing is enough to improve peak detection. You may also try other methods without data smoothing, for example, SPP from Peter Park's lab or GPS from David Gifford's lab which can detect regions mainly based on forward and reverse reads balancing."


please read the related papers also:

Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification 

http://www.nature.com/nmeth/journal/v11/n1/full/nmeth.2762.html

Current bioinformatic approaches to identify DNase I hypersensitive sites and genomic footprints from DNase-seq data
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3484326/

A Comparison of Peak Callers Used for DNase-Seq Data

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0096303

and a discussion on biostar 
https://www.biostars.org/p/70087/