Diving into Genetics and Genomics: My first ever minimal working ChIP-seq pipeline using snakemake

Sunday, May 29, 2016

My first ever minimal working ChIP-seq pipeline using snakemake

Why using snakemake

Snakemake is a python3 based pipeline building tool (a python variant of GNU make) specialized for bioinformatics. I put my notes managing different versions of python here. You can write any python codes inside the Snakefile. Using snakemake is to simplify the tedious pre-processing work for large genomic data sets and for the sake of reproducibility. There are many other tools you can find here for this purpose.

Key features of snakemake

Snakemake automatically creates missing directories.
wildcards and Input function

To access wildcards in a shell command: {wildcards.sample}

{wildcards} is greedy (.+): {sample}.fastq could be matching sampleA.fastq if there is no sub-folder anymore, but evenwhateverfolder/sampleA.fastq can be matched as well.

One needs to think snakemake in a bottom-up way: snakemake will first look for the output files, and substitue the {wildcards} with the file names, and look for which rule can be used to creat the output, and then look for input files that are defined by the {wildcards}.

Read the following

flexible bioinformatics pipelines with snakemake
Build bioinformatics pipelines with Snakemake
snakemake ChIP-seq pipeline example
submit all the jobs immediately
snakemake-parallel-bwa
RNA-seq snakemake example
functions as inputs and derived parameters
snakemake FAQ
snakemake tutorial from the developer

examples

https://github.com/slowkow/snakefiles/blob/master/bsub.py
https://github.com/broadinstitute/viral-ngs/tree/master/pipes

A working snakemake pipeline for ChIP-seq

The folder structure is like this:

├── README.md
├── Snakemake
├── config.yaml
└── rawfastqs
    ├── sampleA
    │   ├── sampleA_L001.fastq.gz
    │   ├── sampleA_L002.fastq.gz
    │   └── sampleA_L003.fastq.gz
    ├── sampleB
    │   ├── sampleB_L001.fastq.gz
    │   ├── sampleB_L002.fastq.gz
    │   └── sampleB_L003.fastq.gz
    ├── sampleG1
    │   ├── sampleG1_L001.fastq.gz
    │   ├── sampleG1_L002.fastq.gz
    │   └── sampleG1_L003.fastq.gz
    └── sampleG2
        ├── sampleG2_L001.fastq.gz
        ├── sampleG2_L002.fastq.gz
        └── sampleG2_L003.fastq.gz

There is a folder named rawfastqs containing all the raw fastqs. each sample subfolder contains multiple fastq files from different lanes.

In this example, I have two control (Input) samples and two corresponding case(IP) samples.

CONTROLS = ["sampleG1","sampleG2"]
CASES = ["sampleA", "sampleB"]

putting them in a list inside the Snakefile. If there are many more samples, need to generate it with pythonprogrammatically.

## dry run
snakemake -np

## work flow diagram
snakemake --forceall --dag | dot -Tpng | display

To Do:

Make the pipeline more flexiable. e.g. specify the folder name containing raw fastqs, now it is hard coded.
write a wrapper script for submitting jobs in moab. Figuring out dependencies and --immediate-submit

3 comments:

Joyce FalerJune 8, 2016 at 10:02 PM
Just found your website by following you through twitter. You have lots of great stuff! I will be back frequently to check out your past (and future) posts. I love R and plan to learn more about python and perl. I am also a wet scientist, sometimes literally, having worked on salmonid genetics for over 18 years. Now working on sugar beet genetics. Cheers! -Joyce
ReplyDelete
Replies
midnDecember 8, 2019 at 11:27 PM
salg kopi eksklusive klokker, der kombinerer elegant stil og avanceret teknologi, en række forskellige stilarter af salg accessoires kopi eksklusive klokker, går markøren mellem din eksklusive smagstil.
ReplyDelete
Replies

Add comment

Diving into Genetics and Genomics

My github papge