Diving into Genetics and Genomics: batch converting coordinates to sequences

Tuesday, April 30, 2013

batch converting coordinates to sequences

If you have a file which contains the coordinates of the chromosomes ( let's say you have a bed https://genome.ucsc.edu/FAQ/FAQformat.html#format1 file from a
ChIP-seq peak call) , how can you get the sequences of corresponding coordinates? ( you may need the sequences for motif analysis by MEME http://meme.nbcr.net/meme/intro.html )

1. a simple way is using the table browser in UCSC genome browser

Extracting sequence in batch from an assembly


	Question: "I have a lot of coordinates for an assembly and want to extract the corresponding sequences. What is the best way to proceed? Response: There are two ways to extract genomic sequence in batch from an assembly: A. Download the appropriate fasta files from our ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently. Sequence data for most assemblies is located in the assembly's "chromosomes" subdirectory on the downloads server. For example, the sequence for human assembly hg17 can be found inftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/. You'll find instructions for obtaining our source programs and utilities here. Some programs that you may find useful are nibFrag and twoBitToFa, as well as other fa* programs. To obtain usage information about most programs, execute it without arguments. B. Use the Table browser to extract sequence. This is a convenient way to obtain small amounts of sequence. Create a custom track of the genomic coordinates in BED format and upload into the Genome Browser. Select the custom track in the Table browser, then select the "sequence" output format to retrieve data. We recommend that you save the file locally as gzip.

Question:
"I have a lot of coordinates for an assembly and want to extract the corresponding sequences. What is the best way to proceed?

Response:

There are two ways to extract genomic sequence in batch from an assembly:

A. Download the appropriate fasta files from our ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently. Sequence data for most assemblies is located in the assembly's "chromosomes" subdirectory on the downloads server. For example, the sequence for human assembly hg17 can be found inftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/. You'll find instructions for obtaining our source programs and utilities here. Some programs that you may find useful are nibFrag and twoBitToFa, as well as other fa* programs. To obtain usage information about most programs, execute it without arguments.

B. Use the Table browser to extract sequence. This is a convenient way to obtain small amounts of sequence.

Create a custom track of the genomic coordinates in BED format and upload into the Genome Browser.
Select the custom track in the Table browser, then select the "sequence" output format to retrieve data. We recommend that you save the file locally as gzip.

Use bedtools

http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html

http://www.biostars.org/p/19267/

$ bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> -fo <output FASTA>

but you will need to download the whole genome sequences in a fasta format.

3. another option will be using python package pygr https://code.google.com/p/pygr/

>>> from pygr import worldbase
>>> worldbase.dir('Bio.Seq.Genome.HUMAN')
['Bio.Seq.Genome.HUMAN.hg17', 'Bio.Seq.Genome.HUMAN.hg18', 'Bio.Seq.Genome.HUMAN.hg19']

>>> hg17 = worldbase.Bio.Seq.Genome.HUMAN.hg17()

>>> len(hg17)
46
>>> hg17.keys()
['chr6_random', 'chr19_random', 'chr8_random', 'chrY', 'chrX', 'chr13', 'chr12', 'chr11', 'chr15_random', 'chr17', 'chr16', 'chr15', 'chr14', 'chr19', 'chr18', 'chrM', 'chr1_random', 'chr13_random', 'chr3_random', 'chr6_hla_hap2', 'chr9_random', 'chr22_random', 'chr10', 'chr4_random', 'chr18_random', 'chr2_random', 'chr22', 'chr20', 'chr21', 'chr10_random', 'chr6_hla_hap1', 'chr7', 'chr6', 'chr5', 'chr4', 'chr3', 'chr2', 'chr1', 'chr7_random', 'chrX_random', 'chr9', 'chr8', 'chr16_random', 'chr5_random', 'chr17_random', 'chr12_random']
>>> chr1 = hg17['chr1']
>>> len(chr1)
245522847
>>> s = chr1[100000000:100001000]
>>> len(s)
1000
>>> repr(s)
'chr1[100000000:100001000]'
>>> print s
ATTCAGGCAATGTTGACTTCATTAGTGTTGACAGAAAATAAGCATAAAAATGCAAAACATTGTTGGCTTAACCTGAATATACCTGCATTACCTATTATTAACCAGTTTGTGTTGCCAAAAGACTTATTCCTTGGCATTAAAATGGAGCACTTAAAAATATTTCTAAAAAGCAAATGCCCACACGCTGGTCTTTGCAGCAAATAAGGGTATTTATACTTTTAAAATATTTTAAGTCCATAATTGGATTAATATACACACCTTCTTATGTATAAGGAGTTCAGATCATATAAACACCGTACAATCCAAAAAACCCTACTGAGAATAAAACTAAATAGGCTTATGATAAGAAATACAGATATTCCCATGTATTTACAAATATCATAGACACACAAATTTGGTCAAATACTGTAAAGAAAGAAGAAGtacctgtacctctacctctacctctacctcctctacctcctctacctcctctacctcctctacctcctctacctcctcttcctctacctctacctctacctctacctctacccacggtctccctttccctctctttccacggtctccctctgatgccgagccgaagctggactgtactgctgccatctcggctcactgcaacctccctgcctgattctcctgcctcaacctgccgagtgcctgcgattgcaggcgcgcgccgccacgcctgactggttttcgtatttttttggtggagacggggtttcgctgtgttggccgggctggtctccagctcctaacctcgagtgatccgccagcctcggcctcccgaggtgccgggtttgcagaggagtctcattcactcagtgctcaatggtgcccaggctggagtgcagtggcgtgatctcggctcgctacaacctccacctcccagcagcctgccttggcctcccaaagtgccgagattgcagtctccgcccggctgccaccccatctgggaagtgaggagcatctctgcctggccgcccatcgtctg

you can easily pull out the sequences based on chromosome name, start, end.

" Again, note that all of these operations were more or less instantaneous (even over the network). These Pygr objects (chr1, s) are just abstract representations of the specified pieces of the human genome, so creating these objectsdoes not require Pygr to download all of human chromosome 1. Only when you force it to show a sequence string does Pygr obtain that data over the network – and then only the piece that you need."

4 comments:

tommyMay 18, 2013 at 4:54 PM
later I found Homer can do similar job.
http://biowhat.ucsd.edu/homer/ngs/homerTools.html

Extracting Genomic Sequences From FASTA Files
The extract command can be used to extract large numbers of specific genomic sequence. The first input file you need is a HOMER style peak file or a BED file with genomic locations. Next, you must have the genomic DNA sequences in one of two formats: (1) a directory of chr1.fa, chr2.fa FASTA files (can be masked file like *.fa.masked), or

(2) a single file FASTA file with all of the chromosomes concatonated in one file. The sequences are sent to stdout as a tab-delimited file, or as a FASTA formatted file if "-fa" is added to the end of the command. Save the output to a file by adding " > outputfile.txt" to the end of the command. The program is run like this:

homerTools extract [-fa]

i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ > outputSequences.txt
Or, to get FASTA files back, i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ -fa > outputSequence.fa
ReplyDelete
Replies
MaxMay 29, 2015 at 2:39 AM
Use a command like this:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit -bed=input.bed test.fa

or for a single region:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr21 -start=1 -end=10000

Requires the UCSC tool twoBitToFa, available from http://hgdownload.cse.ucsc.edu/admin/exe/
ReplyDelete
Replies
UnknownJune 27, 2018 at 10:44 AM
Clear answers, thank you!
ReplyDelete
Replies

Add comment

Diving into Genetics and Genomics

My github papge

Tuesday, April 30, 2013

batch converting coordinates to sequences

4 comments:

Labels

My Blog List