Why using snakemake
Snakemake is a python3 based pipeline building tool (a python variant of GNU make) specialized for bioinformatics. I put my notes managing different versions of python here. You can write any python codes inside the Snakefile. Using snakemake is to simplify the tedious pre-processing work for large genomic data sets and for the sake of reproducibility. There are many other tools you can find here for this purpose.
Key features of snakemake
- Snakemake automatically creates missing directories.
- wildcards and Input function
To access wildcards in a shell command:
{wildcards.sample}
{wildcards}
is greedy (.+)
: {sample}.fastq
could be matching sampleA.fastq
if there is no sub-folder anymore, but evenwhateverfolder/sampleA.fastq
can be matched as well.
One needs to think snakemake in a bottom-up way: snakemake will first look for the output files, and substitue the
{wildcards} with
the file names, and look for which rule can be used to creat the output, and then look for input files that are defined by the {wildcards}
.Read the following
flexible bioinformatics pipelines with snakemake
Build bioinformatics pipelines with Snakemake
snakemake ChIP-seq pipeline example
submit all the jobs immediately
snakemake-parallel-bwa
RNA-seq snakemake example
functions as inputs and derived parameters
snakemake FAQ
snakemake tutorial from the developer
Build bioinformatics pipelines with Snakemake
snakemake ChIP-seq pipeline example
submit all the jobs immediately
snakemake-parallel-bwa
RNA-seq snakemake example
functions as inputs and derived parameters
snakemake FAQ
snakemake tutorial from the developer
examples
https://github.com/slowkow/snakefiles/blob/master/bsub.py
https://github.com/broadinstitute/viral-ngs/tree/master/pipes
https://github.com/broadinstitute/viral-ngs/tree/master/pipes
A working snakemake pipeline for ChIP-seq
The folder structure is like this:
├── README.md
├── Snakemake
├── config.yaml
└── rawfastqs
├── sampleA
│ ├── sampleA_L001.fastq.gz
│ ├── sampleA_L002.fastq.gz
│ └── sampleA_L003.fastq.gz
├── sampleB
│ ├── sampleB_L001.fastq.gz
│ ├── sampleB_L002.fastq.gz
│ └── sampleB_L003.fastq.gz
├── sampleG1
│ ├── sampleG1_L001.fastq.gz
│ ├── sampleG1_L002.fastq.gz
│ └── sampleG1_L003.fastq.gz
└── sampleG2
├── sampleG2_L001.fastq.gz
├── sampleG2_L002.fastq.gz
└── sampleG2_L003.fastq.gz
There is a folder named
rawfastqs
containing all the raw fastqs. each sample subfolder contains multiple fastq files from different lanes.
In this example, I have two control (Input) samples and two corresponding case(IP) samples.
CONTROLS = ["sampleG1","sampleG2"]
CASES = ["sampleA", "sampleB"]
putting them in a list inside the
Snakefile
. If there are many more samples, need to generate it with python
programmatically.## dry run
snakemake -np
## work flow diagram
snakemake --forceall --dag | dot -Tpng | display
To Do:
- Make the pipeline more flexiable. e.g. specify the folder name containing raw fastqs, now it is hard coded.
- write a wrapper script for submitting jobs in
moab
. Figuring out dependencies and--immediate-submit