Diving into Genetics and Genomics: May 2016

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Sunday, May 29, 2016

My first ever minimal working ChIP-seq pipeline using snakemake

Why using snakemake

Snakemake is a python3 based pipeline building tool (a python variant of GNU make) specialized for bioinformatics. I put my notes managing different versions of python here. You can write any python codes inside the Snakefile. Using snakemake is to simplify the tedious pre-processing work for large genomic data sets and for the sake of reproducibility. There are many other tools you can find here for this purpose.

Key features of snakemake

Snakemake automatically creates missing directories.
wildcards and Input function

To access wildcards in a shell command: {wildcards.sample}

{wildcards} is greedy (.+): {sample}.fastq could be matching sampleA.fastq if there is no sub-folder anymore, but evenwhateverfolder/sampleA.fastq can be matched as well.

One needs to think snakemake in a bottom-up way: snakemake will first look for the output files, and substitue the {wildcards} with the file names, and look for which rule can be used to creat the output, and then look for input files that are defined by the {wildcards}.

Read the following

flexible bioinformatics pipelines with snakemake
Build bioinformatics pipelines with Snakemake
snakemake ChIP-seq pipeline example
submit all the jobs immediately
snakemake-parallel-bwa
RNA-seq snakemake example
functions as inputs and derived parameters
snakemake FAQ
snakemake tutorial from the developer

examples

https://github.com/slowkow/snakefiles/blob/master/bsub.py
https://github.com/broadinstitute/viral-ngs/tree/master/pipes

A working snakemake pipeline for ChIP-seq

The folder structure is like this:

├── README.md
├── Snakemake
├── config.yaml
└── rawfastqs
    ├── sampleA
    │   ├── sampleA_L001.fastq.gz
    │   ├── sampleA_L002.fastq.gz
    │   └── sampleA_L003.fastq.gz
    ├── sampleB
    │   ├── sampleB_L001.fastq.gz
    │   ├── sampleB_L002.fastq.gz
    │   └── sampleB_L003.fastq.gz
    ├── sampleG1
    │   ├── sampleG1_L001.fastq.gz
    │   ├── sampleG1_L002.fastq.gz
    │   └── sampleG1_L003.fastq.gz
    └── sampleG2
        ├── sampleG2_L001.fastq.gz
        ├── sampleG2_L002.fastq.gz
        └── sampleG2_L003.fastq.gz

There is a folder named rawfastqs containing all the raw fastqs. each sample subfolder contains multiple fastq files from different lanes.

In this example, I have two control (Input) samples and two corresponding case(IP) samples.

CONTROLS = ["sampleG1","sampleG2"]
CASES = ["sampleA", "sampleB"]

putting them in a list inside the Snakefile. If there are many more samples, need to generate it with pythonprogrammatically.

## dry run
snakemake -np

## work flow diagram
snakemake --forceall --dag | dot -Tpng | display

To Do:

Make the pipeline more flexiable. e.g. specify the folder name containing raw fastqs, now it is hard coded.
write a wrapper script for submitting jobs in moab. Figuring out dependencies and --immediate-submit

Thursday, May 5, 2016

Downsampling bam files to a certain number of reads

When compare different ChIP-seq data sets or analyze a set of ChIP-seq data sets together (e.g. ChromHMM analysis), it is desirable to subsample the deeply sequenced ones to a certain number of reads (say 15million or 30 million).

In paper Integrative analysis of 111 reference human epigenomes:

To avoid artificial differences in signal strength due to differences in sequencing depth, all consolidated histone mark data sets (except the additional histone marks the seven deeply profiled epigenomes, Fig. 2j) were uniformly subsampled to a maximum depth of 30 million reads (the median read depth over all consolidated samples). For the seven deeply profiled reference epigenomes (Fig. 2j), histone mark data sets were subsampled to a maximum of 45 million reads (median depth). The consolidated DNase-seq data sets were subsampled to a maximum depth of 50 million reads (median depth). These uniformly subsampled data sets were then used for all further processing steps (peak calling, signal coverage tracks, chromatin states).

After reading several posts here and here. It seems samtools and sambamba are the tools to use, but they both output a proportion number of reads.

time samtools view -s 3.6 -b my.bam -o subsample.bam
real    6m9.141s
user    5m59.842s
sys 0m8.912s

time sambamba view -f bam -t 10 --subsampling-seed=3 -s 0.6 my.bam -o subsample.bam
real    1m34.937s
user    11m55.222s
sys 0m29.872s

-s 3.6 set seed of 3 and 60% of the reads by samtools. Using multiple cpu with sambamba is much faster and an index file is generated on the fly.

If one wants to get say 15 million reads, one needs to do samtools flag stat or samtools idxstats to get the total number of reads, and then calculate the proportion by: 15 million/total = proportion.

samtools idxstats is much faster when the bam is sorted and indexed:

Retrieve and print stats in the index file. The output is TAB delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.

Total number of reads: samtools idxstats example.bam | cut -f3 | awk 'BEGIN {total=0} {total += $1} END {print total}'

Finally, feed the proportion to -s flag. One might want to remove the unmapped the reads and the duplicated reads in the bam file before downsampling. One might also need to sort the subsampled bam file again and index it.

Diving into Genetics and Genomics

My github papge