Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Friday, May 17, 2013

Use tabix to extract subset of SAM, bed and VCF files

see here:
and here :
and here:

disclaimer: credits go to the original authors, refer to the links above for detailed
You can use tabix to extract subsets of the vcf files from the 1000genomes websites. Thanks to the fact that tabix uses a index file, you will be able to download only portions of the files, without having to download everything in local.
tabix -h 1:10000,200000

Using tabix

Tabix as described in the abstract of the paper:
Tabix is the first generic tool that indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions. Tabix features include few seek function calls per query, data compression with gzip compatibility and direct FTP/HTTP access. Tabix is implemented as a free command-line tool as well as a library in C, Java, Perl and Python. It is particularly useful for manually examining local genomic features on the command line and enables genome viewers to support huge data files and remote custom tracks over networks.
When I came across tabix a couple of years ago, I looked at the paper and also the man pageand while I felt it could be extremely useful, I didn't really appreciate the role it served. Now that I've seen what a colleague has achieved with it, I can see the light. Here I demonstrate with two simple examples some of the advantages of using tabix, namely speed and compression.

Testing tabix with SAM files

For the BAM file example, let's use one from The 1000 Genomes Project.
Now convert to SAM and index:
samtools view -h NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam > NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam
bgzip NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam
tabix -p sam NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz
Of note is the file sizes:
1717942076 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam
1624632294 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz
bgzip compression of a SAM file is smaller than compressing to BAM; I can still use samtools with the bgzipped file though not as quick as processing a BAM file.
Now let's try some examples with tabix and samtools (samtools can also return reads within coordinates see this page) for returning reads mapped within genomic coordinates:
tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-70000
SRR032291.13719341      163     11      60046   0       67M9S   =       60358   346     TAAATACACTGAAGCTGCCAAAACAATCTATCGTTTTGCCTACGTACTTATCAACTTCCTCATAGCAACATGGGAG    ?DEE?@BFDIHDGGHKHIHJHHF6H?<2HE?E>7:=3D7/CB=;.D?G@=)16.>@=+18+@(5@9########## X0:i:10 X1:i:0  XC:i:67 MD:Z:67 RG:Z:SRR032291  AM:i:0  NM:i:0  SM:i:0  MQ:i:0  XT:A:R  BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[rest of results snipped]
time tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-5000000 | wc
 550668 11647956 203417916
real    0m3.141s
user    0m3.249s
sys     0m1.524s
#testing samtools using test.bed, which contains the coordinates 11 60000 5000000
time samtools view -F 4 -L test.bed NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam | wc
 537327 11475465 200350980
real    0m30.318s
user    0m30.119s
sys     0m1.800s
Of note is the difference in the time taken to perform the calculations and the difference in the line numbers returned. The line difference is caused by tabix not taking into account the SAM/BAM flags, namely the unmapped flag. Though I don't show it here, I created a SAM file with one read with a SAM flag 4, which marks a read as being unmapped, but with chromosome coordinates and tabix still returned the read.

See here

Testing tabix with bed files

For the FANTOM5 project we have well over thousands of CAGE libraries. My aforementioned colleague prepared a bed file containing all the CAGE tags in all human samples. Despite having more than half a billion lines in this bed file, making a genomic inquiry took only seconds.
time tabix hg19.ctss.bed.gz chr11:60000-5000000 | wc
9577172 57463032 336697101
real    0m6.276s
user    0m7.997s
sys     0m3.092s
For the keen reader, have a look at the paper for tabix.


  1. the tbi file it downloaded was the index file it uses to find the variants

    It should of also streamed the section vcf file to STDOUT and you will need to capture this in a file

    tabix -h 1:233411980-245804116 > myfile.vcf

    should give you a vcf file

    Its best to use the -h option from tabix as this gives you the header which is important for interpreting the file correctly

  2. you can also use "samtools view yourbamfile.bam " with coords being optional quite similar to how a tabix indexed sam file would behave