Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Sunday, May 29, 2016

My first ever minimal working ChIP-seq pipeline using snakemake

Why using snakemake

Snakemake is a python3 based pipeline building tool (a python variant of GNU make) specialized for bioinformatics. I put my notes managing different versions of python here. You can write any python codes inside the Snakefile. Using snakemake is to simplify the tedious pre-processing work for large genomic data sets and for the sake of reproducibility. There are many other tools you can find here for this purpose.

Key features of snakemake

  • Snakemake automatically creates missing directories.
  • wildcards and Input function
To access wildcards in a shell command: {wildcards.sample}
{wildcards} is greedy (.+){sample}.fastq could be matching sampleA.fastq if there is no sub-folder anymore, but evenwhateverfolder/sampleA.fastq can be matched as well.
One needs to think snakemake in a bottom-up way: snakemake will first look for the output files, and substitue the {wildcards} with the file names, and look for which rule can be used to creat the output, and then look for input files that are defined by the {wildcards}.

Read the following

examples

A working snakemake pipeline for ChIP-seq

The folder structure is like this:
├── README.md
├── Snakemake
├── config.yaml
└── rawfastqs
    ├── sampleA
    │   ├── sampleA_L001.fastq.gz
    │   ├── sampleA_L002.fastq.gz
    │   └── sampleA_L003.fastq.gz
    ├── sampleB
    │   ├── sampleB_L001.fastq.gz
    │   ├── sampleB_L002.fastq.gz
    │   └── sampleB_L003.fastq.gz
    ├── sampleG1
    │   ├── sampleG1_L001.fastq.gz
    │   ├── sampleG1_L002.fastq.gz
    │   └── sampleG1_L003.fastq.gz
    └── sampleG2
        ├── sampleG2_L001.fastq.gz
        ├── sampleG2_L002.fastq.gz
        └── sampleG2_L003.fastq.gz

There is a folder named rawfastqs containing all the raw fastqs. each sample subfolder contains multiple fastq files from different lanes.
In this example, I have two control (Input) samples and two corresponding case(IP) samples.
CONTROLS = ["sampleG1","sampleG2"]
CASES = ["sampleA", "sampleB"]
putting them in a list inside the Snakefile. If there are many more samples, need to generate it with pythonprogrammatically.
## dry run
snakemake -np

## work flow diagram
snakemake --forceall --dag | dot -Tpng | display

To Do:

  • Make the pipeline more flexiable. e.g. specify the folder name containing raw fastqs, now it is hard coded.
  • write a wrapper script for submitting jobs in moab. Figuring out dependencies and --immediate-submit

3 comments:

  1. Just found your website by following you through twitter. You have lots of great stuff! I will be back frequently to check out your past (and future) posts. I love R and plan to learn more about python and perl. I am also a wet scientist, sometimes literally, having worked on salmonid genetics for over 18 years. Now working on sugar beet genetics. Cheers! -Joyce

    ReplyDelete
    Replies
    1. Hi Joyce, thanks very much for your good words. it is good to learn!

      Delete
  2. salg kopi eksklusive klokker, der kombinerer elegant stil og avanceret teknologi, en række forskellige stilarter af salg accessoires kopi eksklusive klokker, går markøren mellem din eksklusive smagstil.

    ReplyDelete