Diving into Genetics and Genomics: February 2015

Tuesday, February 24, 2015

fetch genomic sequences from coordinates

One of the most common tasks in bioinformatics is to retrieve DNA sequences according to the given coordinates (in bed format or others). I had a post previously to describe this, and if you google around, you will find many other methods. I am going to compile all the methods that I found in this post.

Please refer to the posts in biostar:

https://www.biostars.org/p/19267/

https://www.biostars.org/p/3137/

https://www.biostars.org/p/56/

Method 1: Using DAS server
Have a look at this help page from UCSC genome browser http://genome.ucsc.edu/FAQ/FAQdownloads.html

in a web browser :

http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:100000,100010

This XML file does not appear to have any style information associated with it. The document tree is shown below.

<DNA length="11">cactaagcaca</DNA>

</SEQUENCE>

</DASDNA>

The coordinates are 1 based !
One can also use the ensemble DAS server
http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=1:100000,100010



Segment:
1:100000-100010

100000 cactaagcaca                                                   100010


Ensembl © WTSI / EBI

The coordinates are 1 based too!
Other methods in a gist:
Make sure you know how the software works. Mistaking 0 or 1 based coordinates can be dangerous.
Concerning speed, I have not done a test yet, but if you have a fasta file in your local machine, it will be faster to retrieve sequences. samtools requires you to index the genome fasta first, it takes a while, but you only need to do it once. Similarly, one can use the pyfaidx package https://github.com/mdshw5/pyfaidx . For the cruzdb package, you can mirror the UCSC database to your local computer for faster accession.

Diving into Genetics and Genomics

My github papge

Tuesday, February 24, 2015

fetch genomic sequences from coordinates

Segment: 1:100000-100010

Labels

My Blog List