Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, June 18, 2013

count how many mapped reads in a bam file

I wanted to count how many mapped reads are in a ChIP-seq bam file downloaded from UCSC ECODE project. http://genome.ucsc.edu/ENCODE/downloads.html

A quick google search:
http://left.subtree.org/
and http://seqanswers.com/forums/showthread.php?t=891


The output from short read aligners like Bowtie and BWA is commonly stored in SAM/BAM format. When presented with one of these files a common first task is to calculate the total number of alignments (reads) captured in the file. In this post I show some examples for finding the total number of reads using samtools and directly from Java code. For the examples below, I use the HG00173.chrom11 BAM file from the 1000 genomes project which can be downloaded here.
First, we look at using the samtools command directly. One way to get the total number of alignments is to simply dump the entire SAM file and tell samtools to count instead of print (-c option):
1
2
$ samtools view -c HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5218322
If we’re only interested in counting the total number of mapped reads we can add the -F 4 flag. Alternativley, we can count only the unmapped reads with -f 4:
1
2
3
4
5
6
7
# Mapped reads only
$ samtools view -c -F 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5068340
# Unmapped reads only
$ samtools view -c -f 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
149982
To understand how this works we first need to inspect the SAM format. The SAM format includes a bitwise FLAG field described here. The -f/-F options to the samtools command allow us to query based on the presense/absence of bits in the FLAG field. So -f 4 only output alignments that are unmapped (flag 0×0004 is set) and -F 4 only output alignments that are not unmapped (i.e. flag 0×0004 is not set), hence these would only include mapped alignments.
An example for paired end reads you could do the following. To count the number of reads having both itself and it’s mate mapped:
1
2
$ samtools view -c -f 1 -F 12 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
4906035
The -f 1 switch only includes reads that are paired in sequencing and -F 12 only includes reads that are notunmapped (flag 0×0004 is not set) and where the mate is not unmapped (flag 0×0008 is not set). Here we add 0x0004 + 0x0008 = 12 and use the -F (bits not set), meaning you want to include all reads where neither flag 0×0004 or 0×0008 is set. For help understanding the values for the SAM FLAG field there’s a handy web tool here.
There’s also a nice command included in samtools called flagstat which computes various summary statistics. However, I wasn’t able to find much documentation describing the output and it’s not mentioned anywhere in the man page. This post examines the C code for the flagstat command which provides some insight into the output.
1
2
3
4
5
6
7
8
9
10
11
12
$ samtools flagstat HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5218322 + 0 in total (QC-passed reads + QC-failed reads)
273531 + 0 duplicates
5068340 + 0 mapped (97.13%:-nan%)
5205999 + 0 paired in sequencing
2603248 + 0 read1
2602751 + 0 read2
4881994 + 0 properly paired (93.78%:-nan%)
4906035 + 0 with itself and mate mapped
149982 + 0 singletons (2.88%:-nan%)
19869 + 0 with mate mapped to a different chr
15271 + 0 with mate mapped to a different chr (mapQ>=5)



So, samtools flagstat  is what I need.


tommy@tommy-ThinkPad-T420:~/Desktop$ time samtools flagstat Mcf7Ctcf.sorted.bam 
20769652 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
20769652 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

real 0m12.472s
user 0m11.237s
sys 0m0.384s



it is pretty fast.  with 12 seconds it finished counting a bam file with 20 million reads.

I used to use HTSeq python package http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
 to count the mapped reads, but it is much slower (several mins to count the same file) than samtools.


def aligned_counts(ifile1):
    '''count how many alignments are aligned back to genome, ifile1 is a sorted bam file'''
    import HTSeq
    sortedbamfile= HTSeq.BAM_Reader(ifile1)
    aligned_counts=0
    unaligned_counts=0
    for almnt in sortedbamfile:
        if almnt.aligned:
            aligned_counts+= 1
        else:
            unaligned_counts+=1
    print "number of aligned tags of %s is %d " % (ifile1, aligned_counts)
    print "number of unaligned tags of %s is %d "% (ifile1, unaligned_counts)
    return aligned_counts

Again, I am happy with the existing tools. Do not re-invent the wheels !





5 comments:

  1. samtools idxstats is also nice if the BAM file is indexed, it's almost instantaneous

    ReplyDelete
    Replies
    1. Thanks for the tip. I got to know it

      Delete
  2. All these are good. But I want to know if there is any way to get information about number of reads uniquely mapped.
    Actually I found a code in a nature publication that gives number of uniquely mapped reads, but that was for single end reads. Now I am handling different projects in which some have paired end and some have single end data. In many of my data sets I get very less number of uniquely mapped reads (~10% of the total reads after removing duplicates).
    So to check it manually I tried counting number of unique combination of chromosome name & start positions & end positions, by converting my bam file into bed. But both the strategies are producing different numbers of uniquely mapped reads.
    So I want to know if there is any good and reliable way to count the uniques in any supported format.
    I aligned reads using bowtie2.

    ReplyDelete
  3. Hi,

    I'm just wondering why read1 + read2 (5205999) is not equivalent to the total reads (line 2).

    ReplyDelete