Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, February 24, 2015

fetch genomic sequences from coordinates

One of the most common tasks in bioinformatics is to retrieve DNA sequences according to the given coordinates (in bed format or others). I had a post previously to describe this, and if you google around, you will find many other methods. I am going to compile all the methods that I found in this post.

Please refer to the posts in biostar:


Method 1: Using DAS server
Have a look at this help page from UCSC genome browser http://genome.ucsc.edu/FAQ/FAQdownloads.html

in a web browser :

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<DASDNA>
<SEQUENCE id="chr1" start="100000" stop="100010" version="1.00">
<DNA length="11">cactaagcaca</DNA>
</SEQUENCE>
</DASDNA>
The coordinates are 1 based !
One can also use the ensemble DAS server
http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=1:100000,100010

Segment: 1:100000-100010

100000 cactaagcaca                                                   100010
Ensembl © WTSI / EBI
The coordinates are 1 based too!
Other methods in a gist:
Make sure you know how the software works. Mistaking 0 or 1 based coordinates can be dangerous.
Concerning speed, I have not done a test yet, but if you have a fasta file in your local machine, it will be faster to retrieve sequences. samtools requires you to index the genome fasta first, it takes a while, but you only need to do it once. Similarly, one can use the pyfaidx package https://github.com/mdshw5/pyfaidx .  For the cruzdb package, you can mirror the UCSC database to your local computer for faster accession.