Friday, May 17, 2013

Use tabix to extract subset of SAM, bed and VCF files

see here: http://davetang.org/muse/2013/02/22/using-tabix/
and here : http://www.biostars.org/p/43609/
and here: http://www.1000genomes.org/category/tabix

disclaimer: credits go to the original authors, refer to the links above for detailed
information.
-----------------------------------------------------------------------------------------------

You can use tabix to extract subsets of the vcf files from the 1000genomes websites. Thanks to the fact that tabix uses a index file, you will be able to download only portions of the files, without having to download everything in local.

Example:

tabix -h  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz 1:10000,200000

Using tabix

Posted on February 22, 2013 by Davo

Tabix as described in the abstract of the paper:

Tabix is the first generic tool that indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions. Tabix features include few seek function calls per query, data compression with gzip compatibility and direct FTP/HTTP access. Tabix is implemented as a free command-line tool as well as a library in C, Java, Perl and Python. It is particularly useful for manually examining local genomic features on the command line and enables genome viewers to support huge data files and remote custom tracks over networks.

When I came across tabix a couple of years ago, I looked at the paper and also the man pageand while I felt it could be extremely useful, I didn't really appreciate the role it served. Now that I've seen what a colleague has achieved with it, I can see the light. Here I demonstrate with two simple examples some of the advantages of using tabix, namely speed and compression.

Testing tabix with SAM files

For the BAM file example, let's use one from The 1000 Genomes Project.

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA18553/alignment/NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam

Now convert to SAM and index:

samtools view -h NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam > NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam

bgzip NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam

tabix -p sam NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz

Of note is the file sizes:

1717942076 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam
1624632294 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz

bgzip compression of a SAM file is smaller than compressing to BAM; I can still use samtools with the bgzipped file though not as quick as processing a BAM file.

Now let's try some examples with tabix and samtools (samtools can also return reads within coordinates see this page) for returning reads mapped within genomic coordinates:

tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-70000

SRR032291.13719341      163     11      60046   0       67M9S   =       60358   346     TAAATACACTGAAGCTGCCAAAACAATCTATCGTTTTGCCTACGTACTTATCAACTTCCTCATAGCAACATGGGAG    ?DEE?@BFDIHDGGHKHIHJHHF6H?<2HE?E>7:=3D7/CB=;.D?G@=)16.>@=+18+@(5@9########## X0:i:10 X1:i:0  XC:i:67 MD:Z:67 RG:Z:SRR032291  AM:i:0  NM:i:0  SM:i:0  MQ:i:0  XT:A:R  BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

[rest of results snipped]

time tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-5000000 | wc

 550668 11647956 203417916

real    0m3.141s

user    0m3.249s

sys     0m1.524s

#testing samtools using test.bed, which contains the coordinates 11 60000 5000000

time samtools view -F 4 -L test.bed NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam | wc

 537327 11475465 200350980

real    0m30.318s

user    0m30.119s

sys     0m1.800s

Of note is the difference in the time taken to perform the calculations and the difference in the line numbers returned. The line difference is caused by tabix not taking into account the SAM/BAM flags, namely the unmapped flag. Though I don't show it here, I created a SAM file with one read with a SAM flag 4, which marks a read as being unmapped, but with chromosome coordinates and tabix still returned the read.

See here http://davetang.org/muse/2013/02/22/using-tabix/

and here: http://www.1000genomes.org/category/tabix

Testing tabix with bed files

For the FANTOM5 project we have well over thousands of CAGE libraries. My aforementioned colleague prepared a bed file containing all the CAGE tags in all human samples. Despite having more than half a billion lines in this bed file, making a genomic inquiry took only seconds.

time tabix hg19.ctss.bed.gz chr11:60000-5000000 | wc

9577172 57463032 336697101

real    0m6.276s

user    0m7.997s

sys     0m3.092s

For the keen reader, have a look at the paper for tabix.

4 comments:

tommyMay 17, 2013 at 11:38 AM
the tbi file it downloaded was the index file it uses to find the variants

It should of also streamed the section vcf file to STDOUT and you will need to capture this in a file

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1...notypes.vcf.gz 1:233411980-245804116 > myfile.vcf

should give you a vcf file

Its best to use the -h option from tabix as this gives you the header which is important for interpreting the file correctly
ReplyDelete
Replies
zxcvbnAugust 25, 2015 at 1:40 PM
you can also use "samtools view yourbamfile.bam " with coords being optional quite similar to how a tabix indexed sam file would behave
ReplyDelete
Replies
MARGARET MAGOTHENovember 1, 2021 at 10:48 PM
All thanks to Mr Anderson for helping with my profits and making my fifth withdrawal possible. I'm here to share an amazing life changing opportunity with you. its called Bitcoin / Forex trading options. it is a highly lucrative business which can earn you as much as $2,570 in a week from an initial investment of just $200. I am living proof of this great business opportunity. If anyone is interested in trading on bitcoin or any cryptocurrency and want a successful trade without losing notify Mr Anderson now.Whatsapp: (+447883246472 )
Email: tdameritrade077@gmail.com
ReplyDelete
Replies

Add comment

Diving into Genetics and Genomics

My github papge