Diving into Genetics and Genomics: 2017

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Sunday, December 31, 2017

The End of 2017

In the end of last year, I wrote a post summarizing the passing 2016. Now, it is time to write the same for 2017! How time flies!

Last year, I wrote:

For the coming 2017, I should be:
1. busy with Phoebe.
2. writing 1-2 papers.
3. writing a book chapter on ChIP-seq for the biostar handbook. It should come out in the mid of 2017.
3. writing a small R package for practice.
4. learning a piano song.

Looking back, it seems I only accomplished 1 and 3 :) I do have two first-author papers in writing-stage, but I have not finished them yet. I wish I could get them out in 2018.

The book chapter on ChIP-seq was published here. If you want a PDF of my chapter, you can download from https://drive.google.com/open?id=1UxB0uhsoWlPvymukP3em8v4vBDI-CKwK

I still have not got a chance to write an R package, which is on my list for long. The coming 2018 is a good time for me to get my hands wet. Our epigenomic project was selected by the Data Science Road-Trip program !! I received the confirmation in the end of 2017. I look forward to learn more R and machine learning for 2 weeks. And the plan is to turn the work into an R package. Many thanks to @JasonWilliamsNY, I saw this opportunity from his tweet. The development of R package is becoming easier with hadley wickham's work usethis and many others.

I failed #4 totally...I did not get any time to practice the piano. My wife is occupied by Phoebe and does not have time to teach me either.

I have some other achievements that I think I need to celebrate though:)

I am proud to say that I am an official instructor for Data Carpentry! I went to the instructor training in UC Davis hosted by Dr.Titus Brown. I am excited to be part of this welcoming community.

I am also excited to be a Genomics Advisory committee member of Data Carpentry and a maintainer of the Wrangling Genomics course materials.

I got to practice some teaching skills learned from the instructor training in a ChIP-seq lab, which is part of the GS01 1143 Introduction to Bioinformatics course.

Together with Dr.Kunal Rai, I authored a book chapter on Computational Analysis of Epigenetic Modifications in Melanoma. It should be out in 2018.

The other thing I want to mention is that I wrote several NGS processing workflows using Snakemake, a python extension. It was my first time to write something seriously using python and I like it a lot.

The most complex workflow I have written is the mutation calling pipeline (find it here).
I follow GATK best practices and incorporate two somatic callers: mutect and lancet. In my current lab of Dr.Jianjun Zhang and Dr.Andrew Futreal, I deal with multi-region or longitudinal tumor samples from the same patient. In the pipeline, I implemented copy-number analysis and pyclone clonal architecture analysis. The Snakefile is over 1000 lines, I need to think about modularizing it.

Of course, my ChIP-seq snakemake processing pipeline is used to uniformly process thousands of data sets generated in Dr.Kunal Rai's lab. I am happy that many wet lab members are learning computation and are using it by themselves.

In addition to DNA-seq and ChIP-seq. I developed several other workflows:
RNA-seq
ATAC-seq
RRBSeq

I will need to better document all the workflows.

For the coming 2018:

1. I will need to get out at least 2 papers. I enjoyed the public service such as involving Data Carpentry, but I know if I do not have papers, my academic future is doomed.

2. That does not mean I will spend less time on teaching. In fact, I plan to bring at least one Data Carpentry workshop down to Genomic Medicine department in MD Anderson Cancer Center.

3. Finish that R package of course!

4. I am expecting my second child Noah in April 2018 ( We survived Hurricane Harvey 2017!!). I know it will be even busier with 2 kids :) I love my first kid Phoebe, she is now 17 months. The joy she has brought to me is not exchangeable with anything else. I endeavor to be a better researcher, but first I need to be a better husband and father.

Looking forward to a brand new 2018! Good luck everyone.

Wednesday, December 20, 2017

Merge Enhancer promoter interaction bedpe files and recursive function in R

Wednesday, November 22, 2017

run gistic2 with sequenza segmentation output from whole exome sequencing

Convert sequenza output to gistic input

Gistic was designed for SNP6 array data. I saw many papers use it for whole exome sequencing data as well.
I have the segment files from sequenza and want to convert them to the gistic input.

Input format for gistic:

segment file:
(1) Sample (sample name)
(2) Chromosome (chromosome number)
(3) Start Position (segment start position, in bases)
(4) End Position (segment end position, in bases)
(5) Num markers (number of markers in segment)
(6) Seg.CN (log2() -1 of copy number)

see a link https://groups.google.com/a/broadinstitute.org/forum/?utm_medium=email&utm_source=footer#!msg/gistic-forum/yYxIe58qLkA/4dXWAPuMEgAJ

The conversion should be log2 (logarithm base 2) - 1, so that copy number 2 is 0.
Every segment start and end in the segments file should appear in the markers file, not the other way around.

when the copy number is 0 (a homozygous deletion of both copies). You can’t do a log2(0)-1, just put a small number e.g. -5

marker file:
https://groups.google.com/a/broadinstitute.org/forum/#!searchin/gistic-forum/marker$20file/gistic-forum/Vq9WWDiy7jU/BSFg2zmBZ1EJ

(1) Marker Name
(2) Chromosome
(3) Marker Position (in bases)

Note gistic2 does not require a marker file anymore.

output of sequenza

sequenza gives a segment file. Segmentation was done by copynumberbioconductor package.

13 columns of the *segments.txt file

"chromosome" "start.pos" "end.pos" "Bf" "N.BAF" "sd.BAF" "depth.ratio" "N.ratio" "sd.ratio" "CNt" "A" "B" "LPP"

We only need the chromosome, start.pos, end.pos, N.BAF and depth.ratio columns.

The depth.ratio column is the GC content normalized ratio. a depth ratio of 1 means it has copy number of 2 (the same as the normal blood control in my case).

UPDATED: 12/17/2017. see a comment below. it is not log2(2^ depth.ratio) -1 rather:

To convert to gistic input, I have to do log2(2 * depth.ratio) - 1

UPDATED 01/03/2018
I have a bunch of sgement files in the same folder.
add the sample name in the final column and do the log2 math in R.

library(tidyverse)
library(readr)
seg_files<- list.files(".", pattern = "*segments.txt", full.names = F) 

seg_dat_list <- lapply(seg_files, function(f) {
        dat<- read_tsv(f, col_names = T, col_types = cols(.default = col_character()))
        sample<- gsub("_vs_.*segments.txt", "", f)
        dat$sample<- sample
        return(dat)
})

seg_dat <- do.call(rbind, seg_dat_list)

gistic_input<- seg_dat %>% select(sample, chromosome, start.pos, end.pos, N.BAF, depth.ratio) %>% mutate(depth.ratio = as.numeric(depth.ratio)) %>% mutate(depth.ratio = log2(2 * depth.ratio) -1)

write_tsv(gistic_input, "all_segments.txt")

Back to bash:


## marker file:

cat all_segments.txt | sed '1d' | cut -f2,3 > markers.txt
cat all_segments.txt | sed '1d' | cut -f2,4 >> markers.txt

## sort the files by chromosome, take the unique ones and number the markers.

cat markers.txt | sort -V -k1,1 -k2,2nr | uniq | nl > markers_gistic.txt

Run gistic

modify the gistic2 script a bit. e.g. change MCR_ROOT folder path

#!/bin/sh
## set MCR environment and launch GISTIC executable

## NOTE: change the line below if you have installed the Matlab MCR in an alternative location
MCR_ROOT=/scratch/genomic_med/apps/Matlab_Complier_runTime
MCR_VER=v83

echo Setting Matlab MCR root to $MCR_ROOT

## set up environment variables
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/runtime/glnxa64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/bin/glnxa64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/sys/os/glnxa64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH
XAPPLRESDIR=$MCR_ROOT/$MCR_VER/MATLAB_Component_Runtime/v83/X11/app-defaults
export XAPPLRESDIR

## launch GISTIC executable
./gp_gistic2_from_seg $@

I removed ./ from the last line since I have put all executables in my path.

mkdir gistic_out
gistic2 -b gistic_out -seg all_segments.txt -refgene /scratch/genomic_med/apps/gistic/refgenefiles/hg19.mat -mk markers_gistic.txt  -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme

Wednesday, September 27, 2017

Use annovar to annotate variants

Annovar is one of the widely used variants annotation tools. It was cited for over 2000 times which is amazing. It annotates the variants in a tabular format which is easy to parse. Other tools such as VEP and VCFanno are alternatives.

First download the annovar package (you will need to register and an email with downloading link will be sent to you)

wget  http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz

tar xvzf annovar.latest.tar.gz

Download the databases

cd annovar


perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/

perl annotate_variation.pl -buildver hg19 -downdb cytoBand humandb/

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar snp129 humandb/

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar exac03nontcga humandb/

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad_genome humandb/

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar cadd13 humandb/

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar clinvar_20170130 humandb/

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar cosmic70 humandb/

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar dbnsfp33a humandb/

Annotate

First download the annovar package (you will need to register and an email with downloading link will be sent to you) only 5 columns are needed for input: chr, start, end, ref, alt. By default, it assumes the file is in 1-based format.


perl /scratch/genomic_med/apps/annovar/annovar/table_annovar.pl patient1_pre-treatment_vs_patient1_leukocyte_recount_alt_fill.txt /scratch/genomic_med/apps/annovar/annovar/humandb/ -buildver hg19 -out patient1_pre-treatment_vs_patient1_leukocyte_recount_annovar -remove -protocol refGene,cytoBand,snp129,cosmic70,dbnsfp33a,cadd13,clinvar_20170130,exac03nontcga,gnomad_genome  -operation g,r,f,f,f,f,f,f,f -nastring NA -polish -otherinfo -thread 2


perl /scratch/genomic_med/apps/annovar/annovar/table_annovar.pl patient1_pre-treatment_vs_patient1_leukocyte_recount_alt_fill.txt /scratch/genomic_med/apps/annovar/annovar/humandb/ -buildver hg19 -out patient1_pre-treatment_vs_patient1_leukocyte_recount_annovar -remove -protocol refGene,cytoBand,snp129  -operation g,r,f -nastring NA  -polish -otherinfo

Thursday, August 10, 2017

bwa-mem multi-mapping reads

An error

I was using TEQC to do quality control of my WES bam files aligned by bwa-mem. My data are paired end, so a function reads2pairs is called to make the paired-end reads to be a single fragment. I then get this error:

> readpairs <- reads2pairs(reads)
Error in reads2pairs(reads) : read pair IDs do not seem to be unique

I asked in the bioconductor support site and went to the source code of that function. It turns out that TEQC was complaining about the not uniquely mapped reads.

How does BWA-MEM determine and mark the not uniquely mapped reads?

google directed me to biostars again with no surprising. Please read this three posts:
https://www.biostars.org/p/76893/
https://www.biostars.org/p/167892/

one extract multi-mapped reads by looking for mapq < 23 and/or the XA flag on the reads

samtools view -q 40 file.bam
samtools view -q 1 file.bam
samtools view -q 23 file.bam

BWA added tags (Tags starting with ‘X’ are specific to BWA):

Tag	Meaning
NM	Edit distance
MD	Mismatching positions/bases
AS	Alignment score
BC	Barcode sequence
X0	Number of best hits
X1	Number of suboptimal hits found by BWA
XN	Number of ambiguous bases in the referenece
XM	Number of mismatches in the alignment
XO	Number of gap opens
XG	Number of gap extentions
XT	Type: Unique/Repeat/N/Mate-sw
XA	Alternative hits; format: (chr,pos,CIGAR,NM;)*
XS	Suboptimal alignment score
XF	Support from forward/reverse alignment
XE	Number of supporting seeds

Checking my own bam files

samtools view my.bam | grep "XA:" | head -2

E00514:124:H2FC7CCXY:1:1221:16741:50076 97 1 69019 0 151M = 69298 430 CTCCTTCTCTTCTTCAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTAT @@CC?=D@E@=F@>F>?FG==?FAGG?F?FFGA<=?>GFAGF?@=G>?=F?>E@>F=G>>>E?=>>;>D@E@;<EE<E=CAE<>;<D@<<<==D@ED@<D@D@E>E<<=D@E@E>=>C@DD@E=CD@<DD@<>=><>G>>G>>>=;;?;;> XA:Z:15,-102463184,151M,0;19,+110607,151M,1; MC:Z:151M BD:Z:NNNNNNNONONNONNONNOONNNOOONONOONONNNGNNNOONNNOONNOONNNOOOMONFNNONNNNNONONNOOOMOOOOONNNNOONGGGNOOOLOOONONOOOONNONOOONNNONNOONOONNNNNNNNGNNOMNNMNGGGGNMNN MD:Z:151 BI:Z:QQPQQQQPOPQQQQQQQQQQQQQQRQQRQQQQRQQQMQQQRRQQQRQQQRQQQQQQRQQQKQQQQQQQQQPQQQRRQQQQQRQQQQQQQQMMMQQRRPQRQPQPQRQRQQQPQRQQQQQPQQRQQQQQQQQQQQMQQRQQQQQMMMMQQQQ NM:i:0 MQ:i:0 AS:i:151 XS:i:151 RG:Z:F17042956987-KY290-2-WES
E00514:124:H2FC7CCXY:1:2219:12855:17641 81 1 69039 0 150M 13 83828251 0 AACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTT =;C?FEAFAFGF<>>>=FF@DD=<@D=<;E<<?D@C@E=<<?D;<;<<E<E;=@DC@D<E@D=<==E<>>=><E@EDD<E=E=E@D<>>F>G@CF>=F>FGAF=GF@?FG===><=@E??E??????F==??F==?FF@EE=<=>D@B@@ XA:Z:15,+102463165,150M,0;19,-110627,150M,1; BD:Z:NOONOOONNNOONNFNNOOONNNOONNOONNNNNOMONGNNNNNNNNOONONOOOMOOOOONNNNOONFFFNNOOLOONONONOOONNNOOONNNNOONOOOOOONNNOONNFNNOMNNMNFFFFNMNNNNNONOONNNNNNOONMMNNN MD:Z:150 BI:Z:QRRQQRQPQQRQQQKQQQRQQQQQQQQRQQQQQPQPQQMQQQQQQQQRQQQQQQQPQRRRQQQQQQRQKKKQQRQPQQQQQQQRQRQQQQQRQQQQRQQRRQRQQQQQQQQQKQQQOQQOQKKKKQOQQQQQQQQQQPQQPPQQQPOQQQ NM:i:0 AS:i:150 XS:i:150 RG:Z:F17042956987-KY290-2-WES

Indeed, most of the reads with XA: tag has a quality score of 0 (fifth column).

samtools view -q 1 my.bam | wc -l
7787794
samtools view my.bam | grep -v "XA:" | wc -l
7972777

## not exactly the same number, what's wrong?
samtools view my.bam | grep  "XA:" | awk '{print $5}' | sort | uniq -c 
201878 0
    463 1
    688 10
    666 11
    677 12
    693 13
    271 14
    777 15
    281 16
    414 17
    564 18
   1429 19
    192 2
    327 20
   3772 21
   1674 22
    742 23
   1543 24
   3106 25
    368 26
   6223 27
    498 28
    514 29
    760 3
    830 30
   1526 31
    954 32
   3726 33
    367 34
    343 35
    379 36
    641 37
    150 38
   1082 39
    442 4
   3058 40
    673 41
    866 42
   2570 43
   1285 44
   6374 45
   6885 46
   7669 47
  22571 48
  17666 49
    611 5
  16128 50
  13824 51
   9864 52
   6277 53
   4328 54
   2568 55
   1440 56
   4547 57
   1462 58
   1888 59
   1171 6
 169162 60
      2 61
      3 62
      1 64
      5 65
      1 67
      2 69
   1611 7
     86 70
   2234 8
   4013 9

so, many reads with XA tag with mapping quality scores > 0 !!

retain only uniquely mapped reads

samtools view -h my.bam | awk '$17 !~ /XA:/|| $1 ~ /^@/' | samtools view -bS - > my_unique.bam

Friday, August 4, 2017

dangerous rm command

rm command is very dangerous because after you remove something, you can not recover it. There is no trash bin in the unix system. If you have some raw data (e.g fastq files), you'd better make them safe by changing the file permissions. in an empty directory, make a folder foo:

mkdir test
cd test
mkdir foo
cd foo
touch {1..4}.fastqs
ls
1.fastqs  2.fastqs  3.fastqs  4.fastqs
cd ..

let's first make the foo directory unwritable

ls -l
drwxr-x--- 2 krai genomic_med   512 Aug  4 22:27 foo

ls -l foo
total 0
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 1.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 2.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 3.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 4.fastqs

#remove the write privilege for the foo folder
chmod u-w foo
ls -l 
dr-xr-x--- 2 krai genomic_med 512 Aug  4 22:31 foo

# the files inside the foo folder does not change
ls -l foo
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 1.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 2.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 3.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 4.fastqs

# now you can not remove the foo folder:
rm -rf foo
rm: cannot remove `foo/2.fastqs': Permission denied
rm: cannot remove `foo/1.fastqs': Permission denied
rm: cannot remove `foo/4.fastqs': Permission denied
rm: cannot remove `foo/3.fastqs': Permission denied

# rm -rf foo/*
rm: cannot remove `foo/1.fastqs': Permission denied
rm: cannot remove `foo/2.fastqs': Permission denied
rm: cannot remove `foo/3.fastqs': Permission denied
rm: cannot remove `foo/4.fastqs': Permission denied

let's make the fastq files unwritable, but change the foo folder back to default:

chmod u+w foo
ls -l 
drwxr-x--- 2 krai genomic_med 512 Aug  4 22:31 foo

chmod u-w foo/*fastqs
ls -l foo
-r--r----- 1 krai genomic_med 0 Aug  4 22:31 1.fastqs
-r--r----- 1 krai genomic_med 0 Aug  4 22:31 2.fastqs
-r--r----- 1 krai genomic_med 0 Aug  4 22:31 3.fastqs
-r--r----- 1 krai genomic_med 0 Aug  4 22:31 4.fastqs

# let's try to remove the fastqs
rm foo/*fastqs
rm: remove write-protected regular empty file `foo/1.fastqs'?

The unix system asks to confirm deletion of the file. let's remove by force:

rm -rf foo/*fastqs
# the system even did not ask!
ls foo/
# nothing!

The files are removed! You can not recover them. see a post here https://unix.stackexchange.com/questions/48579/why-can-rm-remove-read-only-files

Any attempt to access a file's data requires read permission. Any attempt to modify a file's data requires write permission. Any attempt to execute a file (a program or a script) requires execute permission...

Because directories are not used in the same way as regular files, the permissions work slightly (but only slightly) differently. An attempt to list the files in a directory requires read permission for the directory, but not on the files within. An attempt to add a file to a directory, delete a file from a directory, or to rename a file, all require write permission for the directory, but (perhaps surprisingly) not for the files within. Execute permission doesn't apply to directories (a directory can't also be a program). But that permission bit is reused for directories for other purposes.

Execute permission is needed on a directory to be able to cd into it (that is, to make some directory your current working directory).

Execute is needed on a directory to access the "inode" information of the files within. You need this to search a directory to read the inodes of the files within. For this reason the execute permission on a directory is often called search permission instead.

Conclusion: All rm needs is write+execute permission on the parent directory. The permissions of the file itself are irrelevant.

Thursday, July 13, 2017

mount smb on ubuntu

Monday, July 10, 2017

cores, cpus and threads

Some reading for the basics

cores, cpus and threads :
http://www.slac.stanford.edu/comp/unix/package/lsf/currdoc/lsf_admin/index.htm?lim_core_detection.html~main
Traditionally, the value of ncpus has been equal to the number of physical CPUs. However, many CPUs consist of multiple cores and threads, so the traditional 1:1 mapping is no longer useful. A more useful approach is to set ncpus to equal one of the following:

The number of processors
Cores—the number of cores (per processor) * the number of processors (this is the ncpus default setting)
Threads—the number of threads (per core) * the number of cores (per processor) * the number of processors

Hyper-threading:
https://www.howtogeek.com/194756/cpu-basics-multiple-cpus-cores-and-hyper-threading-explained/

Understanding Linux CPU Load - when should you be worried?
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

Quote from our HPC Admin

From our HPC admin Sally Boyd:

On our systems there are actually 2 CPUs with 12 Cores each for a total of 24 ppn (processors per node).
We use CPU and Core interchangeably, but we shouldn’t. We do not use hyperthreading on any of our clusters because it breaks the MPI software (message passing interface, used for multi-node processing). You can consider one thread per processor/core. So the most threads you can have is 24. If various parts of your pipeline use multiple threads and they’re running at the same time, you might want to be sure that all of those add up to 24 and no more. The other thing is that there is some relatively new (to us) code out there that calls a multi-threaded R without specifying number of threads, or else it starts up several iterations of itself, such that the scheduler is not aware. This causes lots of issues. I don’t recall if the code you were running previously that used so many resources was one of those or not.

My problem

I was runnning parallellized freebayes on cluster and needed to specify the number of cores.https://github.com/ekg/freebayes/blob/master/scripts/freebayes-parallel

The command I run:

./freebayes-parallel regions_to_include_freebayes.bed 4 -f {config[ref_fa]} \
        --genotype-qualities \
        --ploidy 2 \
        --min-repeat-entropy 1 \
        --no-partial-observations \
        --report-genotype-likelihood-max \
        {params.outputdir}/{input[0]} {params.outputdir}/{output} 2> {params.outputdir}/{log}

it uses GNU parallel under the hood.

regionsfile=$1
shift
ncpus=$1
shift

command=("freebayes" "$@")

(
#$command | head -100 | grep "^#" # generate header
# iterate over regions using gnu parallel to dispatch jobs
cat "$regionsfile" | parallel -k -j "$ncpus" "${command[@]}" --region {}
) | ../vcflib/scripts/vcffirstheader \
  | ../vcflib/bin/vcfstreamsort -w 1000 \
  | vcfuniq # remove duplicates at region edges

Note that freebayes-parallel was hard-coded ../vcflib/.. one can put the vcflib bin to PATH, and call vcffirstheader and vcfstreamsort directly.

How many threads will be used? In my command, I specified -j 4. effectively, the commands is

(cat regions_to_include_freebayes.bed \
| parallel -k -j 4 "freebayes --region {} -f {config[ref_fa]} \
        --genotype-qualities \
        --ploidy 2 \
        --min-repeat-entropy 1 \
        --no-partial-observations \
        --report-genotype-likelihood-max \
        {params.outputdir}/my.sorted.bam 2> {params.outputdir}/{log})  \
| vcffirstheader \
| vcfstreamsort -w 1000 \
| vcfuniq > {params.outputdir}/{output}

At least 1 cat + 4(-j) + 3 (pipes) = 8 threads will be used.

checking how many cores I have in the computing nodes:

cat /proc/cpuinfo | grep "model name"
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz

grep "model name" /proc/cpuinfo | wc -l
24

I reserved 12 cores to run the job. checking the job after submitting:

bjobs -l 220806 

## some output
 RUNLIMIT                
 1440.0 min of chms025

 MEMLIMIT
     32 G 
Mon Jul  3 16:06:49: Started 12 Task(s) on Host(s) <chms025> <chms025> <chms025
                     > <chms025> <chms025> <chms025> <chms025> <chms025> <chms0
                     25> <chms025> <chms025> <chms025>, Allocated 12 Slot(s) on
                      Host(s) <chms025> <chms025> <chms025> <chms025> <chms025>
                     <chms025> <chms025> <chms025> <chms025> <chms025> <chms025
                     > <chms025>, Execution Home </rsrch2/genomic_med/krai>, Ex
                     ecution CWD </rsrch2/genomic_med/krai/scratch/TCGA_CCLE_SK
                     CM/TCGA_SKCM_FINAL_downsample_RUN/SNV_calling>;
Mon Jul  3 21:15:41: Resource usage collected.
                     The CPU time used is 2132 seconds.
                     MEM: 1.1 Gbytes;  SWAP: 2.3 Gbytes;  **NTHREAD: 17**
                     PGID: 26713;  PIDs: 26713 26719 26722 26729 26734 26783 
                     26784 26786 26788 1301 1302 1303 1304 26785 26787 26789 


 MEMORY USAGE:
 MAX MEM: 1.9 Gbytes;  AVG MEM: 1 Gbytes

It says 17 threads are used.

I went to the computing nodes, and checked PIDs related to my job:

ssh chms025
uptime
21:19:39 up 410 days,  9:33,  1 user,  load average: **5.94, 5.91, 5.87**
 
top -u krai -M -n 1 -b | grep krai
32381 krai      20   0  486m 314m 1808 R 100.0  0.1   0:01.37 freebayes                                                 
32382 krai      20   0  240m 224m 1808 R 98.4  0.1   0:01.15 freebayes                                                  
32360 krai      20   0  195m 179m 1912 R 92.6  0.0   0:02.95 freebayes                                                  
32390 krai      20   0  204m 188m 1808 R 54.0  0.0   0:00.28 freebayes                                                  
32388 krai      20   0 15568 1648  848 R  1.9  0.0   0:00.02 top                                                        
26713 krai      20   0 20388 2684 1460 S  0.0  0.0   0:41.56 res                                                        
26719 krai      20   0  103m 1256 1032 S  0.0  0.0   0:00.00 1499116008.2208                                            
26722 krai      20   0  103m  804  556 S  0.0  0.0   0:00.00 1499116008.2208                                            
26729 krai      20   0  258m  22m 4352 S  0.0  0.0   0:02.19 python                                                     
26734 krai      20   0  105m 1420 1144 S  0.0  0.0   0:00.00 bash                                                       
26783 krai      20   0  103m 1300 1060 S  0.0  0.0   0:00.00 freebayes-paral                                            
26784 krai      20   0  103m  488  244 S  0.0  0.0   0:00.00 freebayes-paral                                            
26785 krai      20   0  115m 4872 1928 S  0.0  0.0   0:05.03 python                                                     
26786 krai      20   0  100m 1288  480 S  0.0  0.0   0:00.00 cat                                                        
26787 krai      20   0 29152  11m 1344 S  0.0  0.0   1:46.80 vcfstreamsort                                              
26788 krai      20   0  139m 9.9m 2036 S  0.0  0.0   1:11.87 perl                                                       
26789 krai      20   0 21156 1580 1308 S  0.0  0.0   1:34.24 vcfuniq                                                    
31906 krai      20   0 96072 1768  840 S  0.0  0.0   0:00.00 sshd                                                       
31907 krai      20   0  106m 2076 1464 S  0.0  0.0   0:00.07 bash                                                       
32389 krai      20   0  100m  836  732 S  0.0  0.0   0:00.00 grep

Indeed, there are 4 freebayes (-j 4 from parallel) are running. 1 cat, 1 vcfstreamsort, 1 vcfuniq, not sure where are the 2 python, 1 grep, 1 perl, 2 bash from. My guess is that some scripts are wrapped shell scripts.

Diving into Genetics and Genomics

My github papge