Please refer to the posts in biostar:
Have a look at this help page from UCSC genome browser http://genome.ucsc.edu/FAQ/FAQdownloads.html
in a web browser :
This XML
file does not appear to have any style information associated with it. The
document tree is shown below.
<DASDNA>
<SEQUENCE id="chr1" start="100000" stop="100010" version="1.00">
<DNA length="11">cactaagcaca</DNA>
</SEQUENCE>
</DASDNA>
The coordinates are 1 based !One can also use the ensemble DAS server
http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=1:100000,100010
The coordinates are 1 based too!Segment: 1:100000-100010
100000 cactaagcaca 100010
Other methods in a gist:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#! /usr/bin | |
# put the coordinates in a bed file | |
infile=$1 | |
while read chr start end | |
do | |
samtools faidx ref.fasta $chr:$start-$end >> test.fa | |
done <$infile |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Use the cruzdb python package https://github.com/brentp/cruzdb | |
from cruzdb import sequence | |
print sequence.sequence(‘hg19’,’chr1’, 100000, 100010) | |
# cactaagcaca | |
# 1 based! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# download twoBitToFa here http://hgdownload.cse.ucsc.edu/admin/exe/ | |
twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr1 -start=100000 -end=100010 | |
cat test.fa | |
>chr1:100000-100010 | |
actaagcaca | |
# Note that it is 0 based cooridnates system | |
# Use betools get fasta http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html | |
# you need a fasta file of the whole genome, and a bed file containing the cooridnates | |
bedtools getfasta -fi UCSC_hg19_genome.fa -bed input.bed -fo out.fa | |
cat out.fa | |
>chr1:100000-100010 | |
actaagcaca | |
# it is also 0 based | |
# Use samtools faidx | |
# First, index your fasta file (only have to do this once) | |
samtools faidx reference.fa | |
# then extract the sequences you want | |
samtools faidx ref.fasta chr1:100000-100010 | |
>chr1:100000-100010 | |
cactaagcaca | |
# note that it is 1 based ! |
Concerning speed, I have not done a test yet, but if you have a fasta file in your local machine, it will be faster to retrieve sequences. samtools requires you to index the genome fasta first, it takes a while, but you only need to do it once. Similarly, one can use the pyfaidx package https://github.com/mdshw5/pyfaidx . For the cruzdb package, you can mirror the UCSC database to your local computer for faster accession.
Hii Tommy Tang
ReplyDeleteCan you please provide a python script for plotting the RNA Seq data to find differentially expressed genes. I did this with cummeRbund package in R. But now I need a python script . Thank you !!