Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, February 24, 2015

fetch genomic sequences from coordinates

One of the most common tasks in bioinformatics is to retrieve DNA sequences according to the given coordinates (in bed format or others). I had a post previously to describe this, and if you google around, you will find many other methods. I am going to compile all the methods that I found in this post.

Please refer to the posts in biostar:


Method 1: Using DAS server
Have a look at this help page from UCSC genome browser http://genome.ucsc.edu/FAQ/FAQdownloads.html

in a web browser :

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<DASDNA>
<SEQUENCE id="chr1" start="100000" stop="100010" version="1.00">
<DNA length="11">cactaagcaca</DNA>
</SEQUENCE>
</DASDNA>
The coordinates are 1 based !
One can also use the ensemble DAS server
http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=1:100000,100010

Segment: 1:100000-100010

100000 cactaagcaca                                                   100010
Ensembl © WTSI / EBI
The coordinates are 1 based too!
Other methods in a gist:
#! /usr/bin
# put the coordinates in a bed file
infile=$1
while read chr start end
do
samtools faidx ref.fasta $chr:$start-$end >> test.fa
done <$infile
# Use the cruzdb python package https://github.com/brentp/cruzdb
from cruzdb import sequence
print sequence.sequence(‘hg19’,’chr1’, 100000, 100010)
# cactaagcaca
# 1 based!
# download twoBitToFa here http://hgdownload.cse.ucsc.edu/admin/exe/
twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr1 -start=100000 -end=100010
cat test.fa
>chr1:100000-100010
actaagcaca
# Note that it is 0 based cooridnates system
# Use betools get fasta http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html
# you need a fasta file of the whole genome, and a bed file containing the cooridnates
bedtools getfasta -fi UCSC_hg19_genome.fa -bed input.bed -fo out.fa
cat out.fa
>chr1:100000-100010
actaagcaca
# it is also 0 based
# Use samtools faidx
# First, index your fasta file (only have to do this once)
samtools faidx reference.fa
# then extract the sequences you want
samtools faidx ref.fasta chr1:100000-100010
>chr1:100000-100010
cactaagcaca
# note that it is 1 based !
Make sure you know how the software works. Mistaking 0 or 1 based coordinates can be dangerous.
Concerning speed, I have not done a test yet, but if you have a fasta file in your local machine, it will be faster to retrieve sequences. samtools requires you to index the genome fasta first, it takes a while, but you only need to do it once. Similarly, one can use the pyfaidx package https://github.com/mdshw5/pyfaidx .  For the cruzdb package, you can mirror the UCSC database to your local computer for faster accession.



1 comment:

  1. Hii Tommy Tang

    Can you please provide a python script for plotting the RNA Seq data to find differentially expressed genes. I did this with cummeRbund package in R. But now I need a python script . Thank you !!

    ReplyDelete