Diving into Genetics and Genomics: 2015

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Wednesday, December 23, 2015

Using wget to download specific files from ftp but avoiding the directory structure

I want to download some files from a ftp site, and I only want to download some files with names matching a pattern. How can I do it?
Use wget ! It is a very versatile command and I just got to know several tricks.

When there are many levels of folder, you want to search down to all the folders:

-r --recursive Turn on recursive retrieving.

   -l depth
   --level=depth
       Specify recursion maximum depth level depth.  The default maximum depth is 5.

You can specify what files you want to download or reject using wild cards:

Recursive Accept/Reject Options
-A acclist --accept acclist
-R rejlist --reject rejlist
Specify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.

If you want to save the file to a different name:

-O file --output-document=file
The documents will not be written to the appropriate files, but all will be concatenated together and written to file. If - is used as file, documents will be printed to standard output, disabling link conversion. (Use ./- to print to a file literally named -.)

       Use of -O is not intended to mean simply "use the name file instead of the
       one in the URL;" rather, it is analogous to shell redirection: wget -O file
       http://foo is intended to work like wget -O - http://foo > file; file will be
       truncated immediately, and all downloaded content will be written there.

       For this reason, -N (for timestamp-checking) is not supported in combination
       with -O: since file is always newly created, it will always have a very new
       timestamp. A warning will be issued if this combination is used.

       Similarly, using -r or -p with -O may not work as you expect: Wget won’t just
       download the first file to file and then download the rest to their normal
       names: all downloaded content will be placed in file. This was disabled in
       version 1.11, but has been reinstated (with a warning) in 1.11.2, as there
       are some cases where this behavior can actually have some use.

       Note that a combination with -k is only permitted when downloading a single
       document, as in that case it will just convert all relative URIs to external
       ones; -k makes no sense for multiple URIs when they’re all being downloaded
       to a single file.

If you do not need the folder structure:

-nd --no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions .n).

one alternative way is to specify -nH and --cut-dirs=10 together

-nH --no-host-directories
Disable generation of host-prefixed directories. By default, invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.

--cut-dirs=number
Ignore number directory components. This is useful for getting a fine-grained control over the directory where recursive retrieval will be saved.

       Take, for example, the directory at
       ftp://ftp.xemacs.org/pub/xemacs/.  If you retrieve it with -r, it
       will be saved locally under ftp.xemacs.org/pub/xemacs/.  While the
       -nH option can remove the ftp.xemacs.org/ part, you are still stuck
       with pub/xemacs.  This is where --cut-dirs comes in handy; it makes
       Wget not "see" number remote directory components.  Here are
       several examples of how --cut-dirs option works.

               No options        -> ftp.xemacs.org/pub/xemacs/
               -nH               -> pub/xemacs/
               -nH --cut-dirs=1  -> xemacs/
               -nH --cut-dirs=2  -> .

               --cut-dirs=1      -> ftp.xemacs.org/xemacs/
               ...

       If you just want to get rid of the directory structure, this option
       is similar to a combination of -nd and -P.  However, unlike -nd,
       --cut-dirs does not lose with subdirectories---for instance, with
       -nH --cut-dirs=1, a beta/ subdirectory will be placed to
       xemacs/beta, as one would expect.

If you want to save files to a different folder name:

-P prefix --directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).

Continue to download a file:

-c --continue
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of Wget, or by another program

There are so many different options, just man wget to see all of them! I am impressed on how versatile this command is!

Thursday, November 5, 2015

How to make a box-plot with jittered points for multiple groups

Make a box-plot with jittered points for multiple groups using ggplot2

Let me demonstrate it with an example. Use the ToothGrowth data set in the ggplot2 library

library(ggplot2)

ggplot(ToothGrowth, aes(x=as.factor(dose), y=len, color=supp)) + 
        geom_boxplot(position=position_dodge(0.9))+
        geom_jitter(position=position_dodge(0.9)) +
        xlab("dose")

I want to make the points separate from each other rather than on the same vertical line.

ggplot(ToothGrowth, aes(x= as.factor(dose), y=len, color= supp,fill= supp)) + 
        geom_point(position=position_jitterdodge(dodge.width=0.9)) +
        geom_boxplot(fill="white", alpha=0.1, outlier.colour = NA, 
                     position = position_dodge(width=0.9)) +
        xlab("dose")

Monday, November 2, 2015

convert html to pdf by pandoc

Pandoc is a very useful tool to convert common formats.

First install pandoc on mac by:

brew install pandoc

pandoc requires pdflatex to convert to pdfs.

install mactex:
download it and just double click it should install it.

next, put executables including latexpdf to the path.

echo export PATH=$PATH:/usr/texbin/ >> .bashrc
source ~/.bashrc

You can specify margins of the pdf by -V geometry:margin=1in:

pandoc http://quinlanlab.org/tutorials/cshl2013/gemini.html -V geometry:margin=1in -o gemini.pdf`

when codes in html are too long, they get cut-off

Very thankful, I found the answer in this post:

Save the following as listings-setup.tex

% Contents of listings-setup.tex
\usepackage{xcolor}

\lstset{
    basicstyle=\ttfamily,
    numbers=left,
    keywordstyle=\color[rgb]{0.13,0.29,0.53}\bfseries,
    stringstyle=\color[rgb]{0.31,0.60,0.02},
    commentstyle=\color[rgb]{0.56,0.35,0.01}\itshape,
    numberstyle=\footnotesize,
    stepnumber=1,
    numbersep=5pt,
    backgroundcolor=\color[RGB]{248,248,248},
    showspaces=false,
    showstringspaces=false,
    showtabs=false,
    tabsize=2,
    captionpos=b,
    breaklines=true,
    breakatwhitespace=true,
    breakautoindent=true,
    escapeinside={\%*}{*)},
    linewidth=\textwidth,
    basewidth=0.5em,
}

Then, invoke pandoc:

pandoc https://cran.r-project.org/web/packages/gapmap/vignettes/tcga_example.html --listings -H listings-setup.tex -o gapmap_TCGA.pdf

Thursday, October 29, 2015

merge the sub-cytobands to major cytoband and get the coordinates

In my pervious blog post, I wrote How to get the cooridnates of the cytobands

Now, how about I want to get coordinates of the bigger band merging all the sub-bands.
ex. chr1 p36.33 p36.32, p36.31, p36.23, p36.22, p36.13, p36.12, p36.11....p31.1 to p3 chr1 0 84900000
This is a task that requires some efforts for text processing. First look at what the data look like:

head -30 cytoBand.txt  
chr1    0   2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25
chr1    5400000 7200000 p36.31  gneg
chr1    7200000 9200000 p36.23  gpos25
chr1    9200000 12700000    p36.22  gneg
chr1    12700000    16200000    p36.21  gpos50
chr1    16200000    20400000    p36.13  gneg
chr1    20400000    23900000    p36.12  gpos25
chr1    23900000    28000000    p36.11  gneg
chr1    28000000    30200000    p35.3   gpos25
chr1    30200000    32400000    p35.2   gneg
chr1    32400000    34600000    p35.1   gpos25
chr1    34600000    40100000    p34.3   gneg
chr1    40100000    44100000    p34.2   gpos25
chr1    44100000    46800000    p34.1   gneg
chr1    46800000    50700000    p33 gpos75
chr1    50700000    56100000    p32.3   gneg
chr1    56100000    59000000    p32.2   gpos50
chr1    59000000    61300000    p32.1   gneg
chr1    61300000    68900000    p31.3   gpos50
chr1    68900000    69700000    p31.2   gneg
chr1    69700000    84900000    p31.1   gpos100
chr1    84900000    88400000    p22.3   gneg
chr1    88400000    92000000    p22.2   gpos75
chr1    92000000    94700000    p22.1   gneg
chr1    94700000    99700000    p21.3   gpos75
chr1    99700000    102200000   p21.2   gneg
chr1    102200000   107200000   p21.1   gpos100
chr1    107200000   111800000   p13.3   gneg
chr1    111800000   116100000   p13.2   gpos50

The coordinates are sorted in a way that larger cytoband is also sorted: p3, p2, p1, q1, q2... so we only need to keep the major chromosome band, and keep expanding the end coordinates until it hits a new major band. At the same time, track which chromosome is processed and which major band is processed.

A python script using a list of lists data structure can serve this purpose:

import re
with open("/Users/Tammy/annotations/human/hg19_UCSC_genome/cytoBand.txt", "r") as f:
    # A list of lists
    # { ['chr1', 'p1','107200000', '125000000'], ['chr1',  'p2','84900000', '107200000'] ... ['chr2', 'p1' , '47800000', '93300000'],...} }
    genome_list = []
    # a set of chromosomes seen so far
    chrSet = set()
    # a set of chromosome arms seen so far
    armSet = set()
    # loop over each line
    for line in f:
        # split each line to a list
        lineSplit = line.strip().split()
        chr = lineSplit[0]
        start = lineSplit[1]
        end = lineSplit[2]
        band = lineSplit[3]
        ## this regex capture the p1 of p11, p11.1 or p12...
        arm = re.search('([pq]\d).+', band).group(1)

        # if this is the first time see this chr,
        # empty the armSet
        if chr not in chrSet:
            chrSet.add(chr)
            armSet = set()
            armSet.add(arm)
            arm_list = [chr, start, end, arm]
            genome_list.append(arm_list)
        else:
            if arm not in armSet:
                armSet.add(arm)
                arm_list = [chr, start, end, arm]
                ## append this to the genome_list
                genome_list.append(arm_list)
            else:
                ## keep the start of the arm as previous 
                new_start = genome_list[-1][1]
                ## change the previous end of the arm to current end
                new_end = end
                arm_list = [chr, new_start, new_end, arm]
                # mutate the last entry of arm_list
                genome_list[-1] = arm_list

ofile = open("/Users/Tammy/annotations/human/hg19_UCSC_genome/chrom_arm_list.txt", "w")
for arm_list in genome_list:
    ofile.write(arm_list[0] +"\t" + arm_list[1] + "\t" + arm_list[2] + "\t" + arm_list[3] + "\n")
    ## always remember to close the file!
ofile.close()

Let's look at the results.

$ head chrom_arm_list.txt 
chr1    0   84900000    p3
chr1    84900000    107200000   p2
chr1    107200000   125000000   p1
chr1    125000000   142600000   q1
chr1    142600000   185800000   q2
chr1    185800000   214500000   q3
chr1    214500000   249250621   q4
chr10   0   40200000    p1
chr10   40200000    52900000    q1
chr10   52900000    135534747   q2

There are many ways to achieve the same purpose. Just for some advanture, I used dictionary of dictionaries to record the data

import re
with open("/Users/Tammy/annotations/human/hg19_UCSC_genome/cytoBand.txt", "r") as f:
    # A dictionary of dictionaries
    # { chr1: {'p1': ['107200000', '125000000'], 'p2': ['84900000', '107200000'] ...}, chr2:{ 'p1': ['47800000', '93300000'],...} }
    genome_dict = {}
    # a set of chromosomes seen so far
    chrSet = set()
    for line in f:
        lineSplit = line.strip().split()
        chr = lineSplit[0]
        start = lineSplit[1]
        end = lineSplit[2]
        band = lineSplit[3]
        ## this regex capture the p1 of p11, p11.1 or p12
        arm = re.search('([pq]\d).+', band).group(1)

        ## if it is the first time see this chromosome
        if chr not in chrSet:
            chrSet.add(chr)
            # initiate an empty dictionary
            arm_dict = dict()
            arm_dict[arm] = [start, end]
            genome_dict[chr] = arm_dict
        else:
            # if this chromosome arm seen the first time
            if not arm_dict.get(arm):
                arm_dict[arm] = [start, end]
            else:
                new_start = arm_dict[arm][0]
                new_end = end
                arm_dict[arm] = [new_start, new_end]
            genome_dict[chr] = arm_dict

ofile = open("/Users/Tammy/annotations/human/hg19_UCSC_genome/chrom_arm_dict.txt", "w")
for key in genome_dict.keys():
    for key2 in genome_dict[key].keys():
        ofile.write(key + "\t" + genome_dict[key][key2][0] + "\t" + genome_dict[key][key2][1] + "\t" + key2 +"\n")
ofile.close()

Note that, There is no order of dictionary, so the output may not be sorted. You can sort by chr and start:
cat chrom_arm_dict.txt | sort -k1,1V -k2,2n

V is only for GNU sort which will sort chromsome in alpha-numeric order

If I compare the two results from different ways, they are identical:
cmp <(sort -k1,1V -k2,2n chrom_arm_dict.txt) <(sort -k1,1V -k2,2n chrom_arm_list.txt)

Thursday, October 22, 2015

Cytogenetic band explained

This is a review of the basic concept of cytoband in molecular/cell biology.

See the original link by Strachan, T. and Read, A.P. 1999. Human Molecular Genetics, 2nd ed. New York: John Wiley & Sons.

Each human chromosome has a short arm ("p" for "petit") and long arm ("q" for "queue"), separated by a centromere. The ends of the chromosome are called telomeres.

Each chromosome arm is divided into regions, or cytogenetic bands, that can be seen using a microscope and special stains. The cytogenetic bands are labeled p1, p2, p3, q1, q2, q3, etc., counting from the centromere out toward the telomeres. At higher resolutions, sub-bands can be seen within the bands. The sub-bands are also numbered from the centromere out toward the telomere.

For example, the cytogenetic map location of the CFTR gene is 7q31.2, which indicates it is on chromosome 7, q arm, band 3, sub-band 1, and sub-sub-band 2. The ends of the chromosomes are labeled ptel and qtel. For example, the notation 7qtel refers to the end of the long arm of chromosome 7.

How to get the cooridnates of the cytobands?

You can find it in the UCSC database:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz | gunzip

head cytoBand.txt  
chr1    0   2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25
chr1    5400000 7200000 p36.31  gneg
chr1    7200000 9200000 p36.23  gpos25
chr1    9200000 12700000    p36.22  gneg
chr1    12700000    16200000    p36.21  gpos50
chr1    16200000    20400000    p36.13  gneg
chr1    20400000    23900000    p36.12  gpos25
chr1    23900000    28000000    p36.11  gneg
chr1    28000000    30200000    p35.3   gpos25

If you have a list of genes with cooridnates and want to see which band each gene reside in, you can use bedtools or bioconductor GRanges package to find out.

Someone has written scripts for this kind of task as well. see here

Tuesday, October 6, 2015

convert bam to fastq and remap it with bwa mem

Why we want to revert bam files back to fastq file?

Sometimes, you get a processed bam file from other resources, but you want to map it with another aligner. For example, the old BWA aln does not keep the information for soft-cliped reads which are useful for structural variant detection, the new BWA mem is more suitable for this. So, I want to dump bam to fastq first and then realign it with BWA MEM.

After googleling around, I found:

bedtools bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ>
picard samtofastq
samtools samtools bam2fq

It turns out that it is not that straightforward.

Use samtools

Read this (howto) Revert a BAM file to FastQ format in the GATK forum back to 2013.

Note for advanced users If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):

htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz

The htscmd bamshuf and htscmd bam2fq are legacy code, now they are under samtools.

But the point is that you need to shuffle the reads in the bam file.
from Hengli, the author of samtools and bwa:

The right way to create paired fastq for bwa is:
htscmd bamshuf -uO in.bam tmp-prefix | htscmd bam2fq -as se.fq.gz - | bwa mem -p ref.fa - If your bam is coordinate sorted, it is important to use "bamshuf" to change the ordering; otherwise bwa will fail to infer insert size for a batch of reads coming from centromeres.

BWA -p flag

"If mates.fq file is absent and option -p is not set, this command regards input reads are single-end. If mates.fq is present, this command assumes the i-th read in reads.fq and the i-th read in mates.fq constitute a read pair. If -p is used, the command assumes the 2i-th and the (2i+1)-th read in reads.fq constitute a read pair (such input file is said to be interleaved). In this case, mates.fq is ignored. In the paired-end mode, the mem command will infer the read orientation and >the insert size distribution from a batch of reads.

Updated version

Bam2fq -s only helps when your bam contains singletons. If your bam contains both ends, that option has no effect

samtools bamshuf -uon 128 in.bam tmp-prefix | samtools bam2fq -s se.fq.gz - | bwa mem -p ref.fa -

Use bedtools

From the bcbio source code:

cmd = ("{samtools} sort -n -o -l 1 -@ {num_cores} -m {max_mem} {in_bam} {prefix1} "
                       "| {bedtools} bamtofastq -i /dev/stdin -fq /dev/stdout -fq2 /dev/stdout "
                       "| {bwa} mem -p -M -t {num_cores} -R '{rg_info}' -v 1 {ref_file} - | ")

from the bedtools mannual page:

BAM should be sorted by query name (samtools sort -n aln.bam aln.qsort) if creating paired FASTQ with this option.

That's why the command above uses samtools sort -n to sort by name. My bam files are huge (300Gb). This might be very time consuming.

Speedseq realign

In the speedseq pipeline, there is a script called bamtofastq.py

bamtofastq.py
author: Ira Hall (ihall@genome.wustl.edu) and Colby Chiang (cc2qe@virginia.edu)
version: $Revision: 0.0.1 $
description: Convert a coordinate sorted BAM file to FASTQ
             (ignores unpaired reads)

positional arguments:
  BAM                   Input BAM file(s). If absent then defaults to stdin.

optional arguments:
  -h, --help            show this help message and exit
  -r STR, --readgroup STR
                        Read group(s) to extract (comma separated)
  -n, --rename          Rename reads
  -S, --is_sam          Input is SAM format
  -H FILE, --header FILE
                        Write BAM header to file

Further reading:
questions about bam files
http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files

Check read group in the bam file

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

when align using bwa mem, one can add the read group info by:

bwa mem -R "@RG\tID:foo\tSM:bar"

some bam files contain reads from different read groups. one needs to convert the bam to different fastqs for each read group respectively and then realign each with bwa mem. Lastly, merge all the sorted bam together to get a realigned bam.

Check out HengLi's bwakit

read this post http://www.nxn.se/valent/2017/12/6/low-mapping-rate-6-converting-sorted-bam-to-fastq

samtools fastq requires the bam files sort by name

samtools collate is your friend

What I am using for my snakemake pipeline https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline/blob/multiRG/Snakefile

if not config["from_fastq"]:
    if config["realign"]:
        rule realign:
            input: get_input_files
            output: temp(["01aln_temp/{sample}.RG{id}.bam", "01aln_temp/{sample}.RG{id}.bam.bai"]), "00log/{sample}.RG{id}.align"
            log:
                bwa = "00log/{sample}.RG{id}.align"
            params:
                jobname = "{sample}_RG{id}",
                ## add read group for bwa mem mapping, change accordingly if you know PL:ILLUMINA, LB:library1 PI:200 etc...
                ## http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#non-file-parameters-for-rules
                rg = identify_read_groups,
                ID = identify_read_groups_ID
            message: "converting {input} file with read group {params.ID} to fastqs by bedtools and remapping with bwa mem"
            threads: 8
            shell:
                """
                samtools view -br {params.ID} {input} \
                | samtools sort -n -l 1 -@ 8 -m 2G -T 01aln_temp/{wildcards.sample}_{wildcards.id}.tmp \
                | bedtools bamtofastq -i /dev/stdin -fq /dev/stdout -fq2 /dev/stdout \
                | bwa mem -t 8 -p  -M -v 1 -R '{params.rg}' {config[ref_fa]} - 2> {log.bwa} \
                | samtools sort -m 2G -@ 8 -T {output[0]}.tmp -o {output[0]}
                samtools index {output[0]}

Diving into Genetics and Genomics

My github papge