and here : http://www.biostars.org/p/43609/
and here: http://www.1000genomes.org/category/tabix
disclaimer: credits go to the original authors, refer to the links above for detailed
information.
-----------------------------------------------------------------------------------------------
You can use tabix to extract subsets of the vcf files from the 1000genomes websites. Thanks to the fact that tabix uses a index file, you will be able to download only portions of the files, without having to download everything in local.
Example:
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz 1:10000,200000
Using tabix
Tabix as described in the abstract of the paper:
Tabix is the first generic tool that indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions. Tabix features include few seek function calls per query, data compression with gzip compatibility and direct FTP/HTTP access. Tabix is implemented as a free command-line tool as well as a library in C, Java, Perl and Python. It is particularly useful for manually examining local genomic features on the command line and enables genome viewers to support huge data files and remote custom tracks over networks.
When I came across tabix a couple of years ago, I looked at the paper and also the man pageand while I felt it could be extremely useful, I didn't really appreciate the role it served. Now that I've seen what a colleague has achieved with it, I can see the light. Here I demonstrate with two simple examples some of the advantages of using tabix, namely speed and compression.
Testing tabix with SAM files
For the BAM file example, let's use one from The 1000 Genomes Project.
1
| wget ftp : //ftp .1000genomes.ebi.ac.uk /vol1/ftp/data/NA18553/alignment/NA18553 .chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam |
Now convert to SAM and index:
1
2
3
| samtools view -h NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam > NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam bgzip NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam tabix -p sam NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz |
Of note is the file sizes:
1717942076 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam
1624632294 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz
1624632294 NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz
bgzip compression of a SAM file is smaller than compressing to BAM; I can still use samtools with the bgzipped file though not as quick as processing a BAM file.
Now let's try some examples with tabix and samtools (samtools can also return reads within coordinates see this page) for returning reads mapped within genomic coordinates:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-70000 SRR032291.13719341 163 11 60046 0 67M9S = 60358 346 TAAATACACTGAAGCTGCCAAAACAATCTATCGTTTTGCCTACGTACTTATCAACTTCCTCATAGCAACATGGGAG ?DEE?@BFDIHDGGHKHIHJHHF6H?<2HE?E>7:=3D7 /CB =;.D?G@=)16.>@=+18+@(5@9 ########## X0:i:10 X1:i:0 XC:i:67 MD:Z:67 RG:Z:SRR032291 AM:i:0 NM:i:0 SM:i:0 MQ:i:0 XT:A:R BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [rest of results snipped] time tabix NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.sam.gz 11:60000-5000000 | wc 550668 11647956 203417916 real 0m3.141s user 0m3.249s sys 0m1.524s #testing samtools using test.bed, which contains the coordinates 11 60000 5000000 time samtools view -F 4 -L test .bed NA18553.chrom11.ILLUMINA.bwa.CHB.low_coverage.20120522.bam | wc 537327 11475465 200350980 real 0m30.318s user 0m30.119s sys 0m1.800s |
Of note is the difference in the time taken to perform the calculations and the difference in the line numbers returned. The line difference is caused by tabix not taking into account the SAM/BAM flags, namely the unmapped flag. Though I don't show it here, I created a SAM file with one read with a SAM flag 4, which marks a read as being unmapped, but with chromosome coordinates and tabix still returned the read.
See here http://davetang.org/muse/2013/02/22/using-tabix/
Testing tabix with bed files
For the FANTOM5 project we have well over thousands of CAGE libraries. My aforementioned colleague prepared a bed file containing all the CAGE tags in all human samples. Despite having more than half a billion lines in this bed file, making a genomic inquiry took only seconds.
1
2
3
4
5
6
| time tabix hg19.ctss.bed.gz chr11:60000-5000000 | wc 9577172 57463032 336697101 real 0m6.276s user 0m7.997s sys 0m3.092s |
For the keen reader, have a look at the paper for tabix.
the tbi file it downloaded was the index file it uses to find the variants
ReplyDeleteIt should of also streamed the section vcf file to STDOUT and you will need to capture this in a file
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1...notypes.vcf.gz 1:233411980-245804116 > myfile.vcf
should give you a vcf file
Its best to use the -h option from tabix as this gives you the header which is important for interpreting the file correctly
you can also use "samtools view yourbamfile.bam " with coords being optional quite similar to how a tabix indexed sam file would behave
ReplyDeleteyes, I know that now :)
DeleteAll thanks to Mr Anderson for helping with my profits and making my fifth withdrawal possible. I'm here to share an amazing life changing opportunity with you. its called Bitcoin / Forex trading options. it is a highly lucrative business which can earn you as much as $2,570 in a week from an initial investment of just $200. I am living proof of this great business opportunity. If anyone is interested in trading on bitcoin or any cryptocurrency and want a successful trade without losing notify Mr Anderson now.Whatsapp: (+447883246472 )
ReplyDeleteEmail: tdameritrade077@gmail.com