Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Sunday, May 29, 2016

My first ever minimal working ChIP-seq pipeline using snakemake

Why using snakemake

Snakemake is a python3 based pipeline building tool (a python variant of GNU make) specialized for bioinformatics. I put my notes managing different versions of python here. You can write any python codes inside the Snakefile. Using snakemake is to simplify the tedious pre-processing work for large genomic data sets and for the sake of reproducibility. There are many other tools you can find here for this purpose.

Key features of snakemake

  • Snakemake automatically creates missing directories.
  • wildcards and Input function
To access wildcards in a shell command: {wildcards.sample}
{wildcards} is greedy (.+){sample}.fastq could be matching sampleA.fastq if there is no sub-folder anymore, but evenwhateverfolder/sampleA.fastq can be matched as well.
One needs to think snakemake in a bottom-up way: snakemake will first look for the output files, and substitue the {wildcards} with the file names, and look for which rule can be used to creat the output, and then look for input files that are defined by the {wildcards}.

Read the following

examples

A working snakemake pipeline for ChIP-seq

The folder structure is like this:
├── README.md
├── Snakemake
├── config.yaml
└── rawfastqs
    ├── sampleA
    │   ├── sampleA_L001.fastq.gz
    │   ├── sampleA_L002.fastq.gz
    │   └── sampleA_L003.fastq.gz
    ├── sampleB
    │   ├── sampleB_L001.fastq.gz
    │   ├── sampleB_L002.fastq.gz
    │   └── sampleB_L003.fastq.gz
    ├── sampleG1
    │   ├── sampleG1_L001.fastq.gz
    │   ├── sampleG1_L002.fastq.gz
    │   └── sampleG1_L003.fastq.gz
    └── sampleG2
        ├── sampleG2_L001.fastq.gz
        ├── sampleG2_L002.fastq.gz
        └── sampleG2_L003.fastq.gz

There is a folder named rawfastqs containing all the raw fastqs. each sample subfolder contains multiple fastq files from different lanes.
In this example, I have two control (Input) samples and two corresponding case(IP) samples.
CONTROLS = ["sampleG1","sampleG2"]
CASES = ["sampleA", "sampleB"]
putting them in a list inside the Snakefile. If there are many more samples, need to generate it with pythonprogrammatically.
## dry run
snakemake -np

## work flow diagram
snakemake --forceall --dag | dot -Tpng | display

To Do:

  • Make the pipeline more flexiable. e.g. specify the folder name containing raw fastqs, now it is hard coded.
  • write a wrapper script for submitting jobs in moab. Figuring out dependencies and --immediate-submit

2 comments:

  1. Just found your website by following you through twitter. You have lots of great stuff! I will be back frequently to check out your past (and future) posts. I love R and plan to learn more about python and perl. I am also a wet scientist, sometimes literally, having worked on salmonid genetics for over 18 years. Now working on sugar beet genetics. Cheers! -Joyce

    ReplyDelete
    Replies
    1. Hi Joyce, thanks very much for your good words. it is good to learn!

      Delete