Why using snakemake
Snakemake is a python3 based pipeline building tool (a python variant of GNU make) specialized for bioinformatics. I put my notes managing different versions of python here. You can write any python codes inside the Snakefile. Using snakemake is to simplify the tedious pre-processing work for large genomic data sets and for the sake of reproducibility. There are many other tools you can find here for this purpose.
Key features of snakemake
- Snakemake automatically creates missing directories.
- wildcards and Input function
To access wildcards in a shell command:
{wildcards.sample}
{wildcards}
is greedy (.+)
: {sample}.fastq
could be matching sampleA.fastq
if there is no sub-folder anymore, but evenwhateverfolder/sampleA.fastq
can be matched as well.
One needs to think snakemake in a bottom-up way: snakemake will first look for the output files, and substitue the
{wildcards} with
the file names, and look for which rule can be used to creat the output, and then look for input files that are defined by the {wildcards}
.Read the following
flexible bioinformatics pipelines with snakemake
Build bioinformatics pipelines with Snakemake
snakemake ChIP-seq pipeline example
submit all the jobs immediately
snakemake-parallel-bwa
RNA-seq snakemake example
functions as inputs and derived parameters
snakemake FAQ
snakemake tutorial from the developer
Build bioinformatics pipelines with Snakemake
snakemake ChIP-seq pipeline example
submit all the jobs immediately
snakemake-parallel-bwa
RNA-seq snakemake example
functions as inputs and derived parameters
snakemake FAQ
snakemake tutorial from the developer
examples
https://github.com/slowkow/snakefiles/blob/master/bsub.py
https://github.com/broadinstitute/viral-ngs/tree/master/pipes
https://github.com/broadinstitute/viral-ngs/tree/master/pipes
A working snakemake pipeline for ChIP-seq
The folder structure is like this:
├── README.md
├── Snakemake
├── config.yaml
└── rawfastqs
├── sampleA
│ ├── sampleA_L001.fastq.gz
│ ├── sampleA_L002.fastq.gz
│ └── sampleA_L003.fastq.gz
├── sampleB
│ ├── sampleB_L001.fastq.gz
│ ├── sampleB_L002.fastq.gz
│ └── sampleB_L003.fastq.gz
├── sampleG1
│ ├── sampleG1_L001.fastq.gz
│ ├── sampleG1_L002.fastq.gz
│ └── sampleG1_L003.fastq.gz
└── sampleG2
├── sampleG2_L001.fastq.gz
├── sampleG2_L002.fastq.gz
└── sampleG2_L003.fastq.gz
There is a folder named
rawfastqs
containing all the raw fastqs. each sample subfolder contains multiple fastq files from different lanes.
In this example, I have two control (Input) samples and two corresponding case(IP) samples.
CONTROLS = ["sampleG1","sampleG2"]
CASES = ["sampleA", "sampleB"]
putting them in a list inside the
Snakefile
. If there are many more samples, need to generate it with python
programmatically.## dry run
snakemake -np
## work flow diagram
snakemake --forceall --dag | dot -Tpng | display
To Do:
- Make the pipeline more flexiable. e.g. specify the folder name containing raw fastqs, now it is hard coded.
- write a wrapper script for submitting jobs in
moab
. Figuring out dependencies and--immediate-submit
Just found your website by following you through twitter. You have lots of great stuff! I will be back frequently to check out your past (and future) posts. I love R and plan to learn more about python and perl. I am also a wet scientist, sometimes literally, having worked on salmonid genetics for over 18 years. Now working on sugar beet genetics. Cheers! -Joyce
ReplyDeleteHi Joyce, thanks very much for your good words. it is good to learn!
Deletesalg kopi eksklusive klokker, der kombinerer elegant stil og avanceret teknologi, en række forskellige stilarter af salg accessoires kopi eksklusive klokker, går markøren mellem din eksklusive smagstil.
ReplyDelete