Diving into Genetics and Genomics: October 2015

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday, October 29, 2015

merge the sub-cytobands to major cytoband and get the coordinates

In my pervious blog post, I wrote How to get the cooridnates of the cytobands

Now, how about I want to get coordinates of the bigger band merging all the sub-bands.
ex. chr1 p36.33 p36.32, p36.31, p36.23, p36.22, p36.13, p36.12, p36.11....p31.1 to p3 chr1 0 84900000
This is a task that requires some efforts for text processing. First look at what the data look like:

head -30 cytoBand.txt  
chr1    0   2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25
chr1    5400000 7200000 p36.31  gneg
chr1    7200000 9200000 p36.23  gpos25
chr1    9200000 12700000    p36.22  gneg
chr1    12700000    16200000    p36.21  gpos50
chr1    16200000    20400000    p36.13  gneg
chr1    20400000    23900000    p36.12  gpos25
chr1    23900000    28000000    p36.11  gneg
chr1    28000000    30200000    p35.3   gpos25
chr1    30200000    32400000    p35.2   gneg
chr1    32400000    34600000    p35.1   gpos25
chr1    34600000    40100000    p34.3   gneg
chr1    40100000    44100000    p34.2   gpos25
chr1    44100000    46800000    p34.1   gneg
chr1    46800000    50700000    p33 gpos75
chr1    50700000    56100000    p32.3   gneg
chr1    56100000    59000000    p32.2   gpos50
chr1    59000000    61300000    p32.1   gneg
chr1    61300000    68900000    p31.3   gpos50
chr1    68900000    69700000    p31.2   gneg
chr1    69700000    84900000    p31.1   gpos100
chr1    84900000    88400000    p22.3   gneg
chr1    88400000    92000000    p22.2   gpos75
chr1    92000000    94700000    p22.1   gneg
chr1    94700000    99700000    p21.3   gpos75
chr1    99700000    102200000   p21.2   gneg
chr1    102200000   107200000   p21.1   gpos100
chr1    107200000   111800000   p13.3   gneg
chr1    111800000   116100000   p13.2   gpos50

The coordinates are sorted in a way that larger cytoband is also sorted: p3, p2, p1, q1, q2... so we only need to keep the major chromosome band, and keep expanding the end coordinates until it hits a new major band. At the same time, track which chromosome is processed and which major band is processed.

A python script using a list of lists data structure can serve this purpose:

import re
with open("/Users/Tammy/annotations/human/hg19_UCSC_genome/cytoBand.txt", "r") as f:
    # A list of lists
    # { ['chr1', 'p1','107200000', '125000000'], ['chr1',  'p2','84900000', '107200000'] ... ['chr2', 'p1' , '47800000', '93300000'],...} }
    genome_list = []
    # a set of chromosomes seen so far
    chrSet = set()
    # a set of chromosome arms seen so far
    armSet = set()
    # loop over each line
    for line in f:
        # split each line to a list
        lineSplit = line.strip().split()
        chr = lineSplit[0]
        start = lineSplit[1]
        end = lineSplit[2]
        band = lineSplit[3]
        ## this regex capture the p1 of p11, p11.1 or p12...
        arm = re.search('([pq]\d).+', band).group(1)

        # if this is the first time see this chr,
        # empty the armSet
        if chr not in chrSet:
            chrSet.add(chr)
            armSet = set()
            armSet.add(arm)
            arm_list = [chr, start, end, arm]
            genome_list.append(arm_list)
        else:
            if arm not in armSet:
                armSet.add(arm)
                arm_list = [chr, start, end, arm]
                ## append this to the genome_list
                genome_list.append(arm_list)
            else:
                ## keep the start of the arm as previous 
                new_start = genome_list[-1][1]
                ## change the previous end of the arm to current end
                new_end = end
                arm_list = [chr, new_start, new_end, arm]
                # mutate the last entry of arm_list
                genome_list[-1] = arm_list

ofile = open("/Users/Tammy/annotations/human/hg19_UCSC_genome/chrom_arm_list.txt", "w")
for arm_list in genome_list:
    ofile.write(arm_list[0] +"\t" + arm_list[1] + "\t" + arm_list[2] + "\t" + arm_list[3] + "\n")
    ## always remember to close the file!
ofile.close()

Let's look at the results.

$ head chrom_arm_list.txt 
chr1    0   84900000    p3
chr1    84900000    107200000   p2
chr1    107200000   125000000   p1
chr1    125000000   142600000   q1
chr1    142600000   185800000   q2
chr1    185800000   214500000   q3
chr1    214500000   249250621   q4
chr10   0   40200000    p1
chr10   40200000    52900000    q1
chr10   52900000    135534747   q2

There are many ways to achieve the same purpose. Just for some advanture, I used dictionary of dictionaries to record the data

import re
with open("/Users/Tammy/annotations/human/hg19_UCSC_genome/cytoBand.txt", "r") as f:
    # A dictionary of dictionaries
    # { chr1: {'p1': ['107200000', '125000000'], 'p2': ['84900000', '107200000'] ...}, chr2:{ 'p1': ['47800000', '93300000'],...} }
    genome_dict = {}
    # a set of chromosomes seen so far
    chrSet = set()
    for line in f:
        lineSplit = line.strip().split()
        chr = lineSplit[0]
        start = lineSplit[1]
        end = lineSplit[2]
        band = lineSplit[3]
        ## this regex capture the p1 of p11, p11.1 or p12
        arm = re.search('([pq]\d).+', band).group(1)

        ## if it is the first time see this chromosome
        if chr not in chrSet:
            chrSet.add(chr)
            # initiate an empty dictionary
            arm_dict = dict()
            arm_dict[arm] = [start, end]
            genome_dict[chr] = arm_dict
        else:
            # if this chromosome arm seen the first time
            if not arm_dict.get(arm):
                arm_dict[arm] = [start, end]
            else:
                new_start = arm_dict[arm][0]
                new_end = end
                arm_dict[arm] = [new_start, new_end]
            genome_dict[chr] = arm_dict

ofile = open("/Users/Tammy/annotations/human/hg19_UCSC_genome/chrom_arm_dict.txt", "w")
for key in genome_dict.keys():
    for key2 in genome_dict[key].keys():
        ofile.write(key + "\t" + genome_dict[key][key2][0] + "\t" + genome_dict[key][key2][1] + "\t" + key2 +"\n")
ofile.close()

Note that, There is no order of dictionary, so the output may not be sorted. You can sort by chr and start:
cat chrom_arm_dict.txt | sort -k1,1V -k2,2n

V is only for GNU sort which will sort chromsome in alpha-numeric order

If I compare the two results from different ways, they are identical:
cmp <(sort -k1,1V -k2,2n chrom_arm_dict.txt) <(sort -k1,1V -k2,2n chrom_arm_list.txt)

Thursday, October 22, 2015

Cytogenetic band explained

This is a review of the basic concept of cytoband in molecular/cell biology.

See the original link by Strachan, T. and Read, A.P. 1999. Human Molecular Genetics, 2nd ed. New York: John Wiley & Sons.

Each human chromosome has a short arm ("p" for "petit") and long arm ("q" for "queue"), separated by a centromere. The ends of the chromosome are called telomeres.

Each chromosome arm is divided into regions, or cytogenetic bands, that can be seen using a microscope and special stains. The cytogenetic bands are labeled p1, p2, p3, q1, q2, q3, etc., counting from the centromere out toward the telomeres. At higher resolutions, sub-bands can be seen within the bands. The sub-bands are also numbered from the centromere out toward the telomere.

For example, the cytogenetic map location of the CFTR gene is 7q31.2, which indicates it is on chromosome 7, q arm, band 3, sub-band 1, and sub-sub-band 2. The ends of the chromosomes are labeled ptel and qtel. For example, the notation 7qtel refers to the end of the long arm of chromosome 7.

How to get the cooridnates of the cytobands?

You can find it in the UCSC database:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz | gunzip

head cytoBand.txt  
chr1    0   2300000 p36.33  gneg
chr1    2300000 5400000 p36.32  gpos25
chr1    5400000 7200000 p36.31  gneg
chr1    7200000 9200000 p36.23  gpos25
chr1    9200000 12700000    p36.22  gneg
chr1    12700000    16200000    p36.21  gpos50
chr1    16200000    20400000    p36.13  gneg
chr1    20400000    23900000    p36.12  gpos25
chr1    23900000    28000000    p36.11  gneg
chr1    28000000    30200000    p35.3   gpos25

If you have a list of genes with cooridnates and want to see which band each gene reside in, you can use bedtools or bioconductor GRanges package to find out.

Someone has written scripts for this kind of task as well. see here

Tuesday, October 6, 2015

convert bam to fastq and remap it with bwa mem

Why we want to revert bam files back to fastq file?

Sometimes, you get a processed bam file from other resources, but you want to map it with another aligner. For example, the old BWA aln does not keep the information for soft-cliped reads which are useful for structural variant detection, the new BWA mem is more suitable for this. So, I want to dump bam to fastq first and then realign it with BWA MEM.

After googleling around, I found:

bedtools bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ>
picard samtofastq
samtools samtools bam2fq

It turns out that it is not that straightforward.

Use samtools

Read this (howto) Revert a BAM file to FastQ format in the GATK forum back to 2013.

Note for advanced users If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):

htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz

The htscmd bamshuf and htscmd bam2fq are legacy code, now they are under samtools.

But the point is that you need to shuffle the reads in the bam file.
from Hengli, the author of samtools and bwa:

The right way to create paired fastq for bwa is:
htscmd bamshuf -uO in.bam tmp-prefix | htscmd bam2fq -as se.fq.gz - | bwa mem -p ref.fa - If your bam is coordinate sorted, it is important to use "bamshuf" to change the ordering; otherwise bwa will fail to infer insert size for a batch of reads coming from centromeres.

BWA -p flag

"If mates.fq file is absent and option -p is not set, this command regards input reads are single-end. If mates.fq is present, this command assumes the i-th read in reads.fq and the i-th read in mates.fq constitute a read pair. If -p is used, the command assumes the 2i-th and the (2i+1)-th read in reads.fq constitute a read pair (such input file is said to be interleaved). In this case, mates.fq is ignored. In the paired-end mode, the mem command will infer the read orientation and >the insert size distribution from a batch of reads.

Updated version

Bam2fq -s only helps when your bam contains singletons. If your bam contains both ends, that option has no effect

samtools bamshuf -uon 128 in.bam tmp-prefix | samtools bam2fq -s se.fq.gz - | bwa mem -p ref.fa -

Use bedtools

From the bcbio source code:

cmd = ("{samtools} sort -n -o -l 1 -@ {num_cores} -m {max_mem} {in_bam} {prefix1} "
                       "| {bedtools} bamtofastq -i /dev/stdin -fq /dev/stdout -fq2 /dev/stdout "
                       "| {bwa} mem -p -M -t {num_cores} -R '{rg_info}' -v 1 {ref_file} - | ")

from the bedtools mannual page:

BAM should be sorted by query name (samtools sort -n aln.bam aln.qsort) if creating paired FASTQ with this option.

That's why the command above uses samtools sort -n to sort by name. My bam files are huge (300Gb). This might be very time consuming.

Speedseq realign

In the speedseq pipeline, there is a script called bamtofastq.py

bamtofastq.py
author: Ira Hall (ihall@genome.wustl.edu) and Colby Chiang (cc2qe@virginia.edu)
version: $Revision: 0.0.1 $
description: Convert a coordinate sorted BAM file to FASTQ
             (ignores unpaired reads)

positional arguments:
  BAM                   Input BAM file(s). If absent then defaults to stdin.

optional arguments:
  -h, --help            show this help message and exit
  -r STR, --readgroup STR
                        Read group(s) to extract (comma separated)
  -n, --rename          Rename reads
  -S, --is_sam          Input is SAM format
  -H FILE, --header FILE
                        Write BAM header to file

Further reading:
questions about bam files
http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files

Check read group in the bam file

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

when align using bwa mem, one can add the read group info by:

bwa mem -R "@RG\tID:foo\tSM:bar"

some bam files contain reads from different read groups. one needs to convert the bam to different fastqs for each read group respectively and then realign each with bwa mem. Lastly, merge all the sorted bam together to get a realigned bam.

Check out HengLi's bwakit

read this post http://www.nxn.se/valent/2017/12/6/low-mapping-rate-6-converting-sorted-bam-to-fastq

samtools fastq requires the bam files sort by name

samtools collate is your friend

What I am using for my snakemake pipeline https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline/blob/multiRG/Snakefile

if not config["from_fastq"]:
    if config["realign"]:
        rule realign:
            input: get_input_files
            output: temp(["01aln_temp/{sample}.RG{id}.bam", "01aln_temp/{sample}.RG{id}.bam.bai"]), "00log/{sample}.RG{id}.align"
            log:
                bwa = "00log/{sample}.RG{id}.align"
            params:
                jobname = "{sample}_RG{id}",
                ## add read group for bwa mem mapping, change accordingly if you know PL:ILLUMINA, LB:library1 PI:200 etc...
                ## http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#non-file-parameters-for-rules
                rg = identify_read_groups,
                ID = identify_read_groups_ID
            message: "converting {input} file with read group {params.ID} to fastqs by bedtools and remapping with bwa mem"
            threads: 8
            shell:
                """
                samtools view -br {params.ID} {input} \
                | samtools sort -n -l 1 -@ 8 -m 2G -T 01aln_temp/{wildcards.sample}_{wildcards.id}.tmp \
                | bedtools bamtofastq -i /dev/stdin -fq /dev/stdout -fq2 /dev/stdout \
                | bwa mem -t 8 -p  -M -v 1 -R '{params.rg}' {config[ref_fa]} - 2> {log.bwa} \
                | samtools sort -m 2G -@ 8 -T {output[0]}.tmp -o {output[0]}
                samtools index {output[0]}

Diving into Genetics and Genomics

My github papge