Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, April 30, 2013

batch converting coordinates to sequences

If you have a file which contains the coordinates of the chromosomes ( let's say you have a bed https://genome.ucsc.edu/FAQ/FAQformat.html#format1 file from a
ChIP-seq peak call) , how can you get the sequences of corresponding coordinates? ( you may need the sequences for motif analysis by MEME http://meme.nbcr.net/meme/intro.html )

1.  a simple way is using the table browser in UCSC genome browser
 Extracting sequence in batch from an assembly
Question: 
"I have a lot of coordinates for an assembly and want to extract the corresponding sequences. What is the best way to proceed?
Response:
There are two ways to extract genomic sequence in batch from an assembly:
A. Download the appropriate fasta files from our ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently. Sequence data for most assemblies is located in the assembly's "chromosomes" subdirectory on the downloads server. For example, the sequence for human assembly hg17 can be found inftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/. You'll find instructions for obtaining our source programs and utilities here. Some programs that you may find useful are nibFrag and twoBitToFa, as well as other fa* programs. To obtain usage information about most programs, execute it without arguments.
B. Use the Table browser to extract sequence. This is a convenient way to obtain small amounts of sequence.
  1. Create a custom track of the genomic coordinates in BED format and upload into the Genome Browser.
  2. Select the custom track in the Table browser, then select the "sequence" output format to retrieve data. We recommend that you save the file locally as gzip.

2.
Use bedtools 


$ bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> -fo <output FASTA>
but you will need to download the whole genome sequences in a fasta format.


3.  another option will be using python package pygr https://code.google.com/p/pygr/
>>> from pygr import worldbase
>>> worldbase.dir('Bio.Seq.Genome.HUMAN')
['Bio.Seq.Genome.HUMAN.hg17', 'Bio.Seq.Genome.HUMAN.hg18', 'Bio.Seq.Genome.HUMAN.hg19']


>>> hg17 = worldbase.Bio.Seq.Genome.HUMAN.hg17()

>>> len(hg17)
46
>>> hg17.keys()
['chr6_random', 'chr19_random', 'chr8_random', 'chrY', 'chrX', 'chr13', 'chr12', 'chr11', 'chr15_random', 'chr17', 'chr16', 'chr15', 'chr14', 'chr19', 'chr18', 'chrM', 'chr1_random', 'chr13_random', 'chr3_random', 'chr6_hla_hap2', 'chr9_random', 'chr22_random', 'chr10', 'chr4_random', 'chr18_random', 'chr2_random', 'chr22', 'chr20', 'chr21', 'chr10_random', 'chr6_hla_hap1', 'chr7', 'chr6', 'chr5', 'chr4', 'chr3', 'chr2', 'chr1', 'chr7_random', 'chrX_random', 'chr9', 'chr8', 'chr16_random', 'chr5_random', 'chr17_random', 'chr12_random']
>>> chr1 = hg17['chr1']
>>> len(chr1)
245522847
>>> s = chr1[100000000:100001000]
>>> len(s)
1000
>>> repr(s)
'chr1[100000000:100001000]'
>>> print s
ATTCAGGCAATGTTGACTTCATTAGTGTTGACAGAAAATAAGCATAAAAATGCAAAACATTGTTGGCTTAACCTGAATATACCTGCATTACCTATTATTAACCAGTTTGTGTTGCCAAAAGACTTATTCCTTGGCATTAAAATGGAGCACTTAAAAATATTTCTAAAAAGCAAATGCCCACACGCTGGTCTTTGCAGCAAATAAGGGTATTTATACTTTTAAAATATTTTAAGTCCATAATTGGATTAATATACACACCTTCTTATGTATAAGGAGTTCAGATCATATAAACACCGTACAATCCAAAAAACCCTACTGAGAATAAAACTAAATAGGCTTATGATAAGAAATACAGATATTCCCATGTATTTACAAATATCATAGACACACAAATTTGGTCAAATACTGTAAAGAAAGAAGAAGtacctgtacctctacctctacctctacctcctctacctcctctacctcctctacctcctctacctcctctacctcctcttcctctacctctacctctacctctacctctacccacggtctccctttccctctctttccacggtctccctctgatgccgagccgaagctggactgtactgctgccatctcggctcactgcaacctccctgcctgattctcctgcctcaacctgccgagtgcctgcgattgcaggcgcgcgccgccacgcctgactggttttcgtatttttttggtggagacggggtttcgctgtgttggccgggctggtctccagctcctaacctcgagtgatccgccagcctcggcctcccgaggtgccgggtttgcagaggagtctcattcactcagtgctcaatggtgcccaggctggagtgcagtggcgtgatctcggctcgctacaacctccacctcccagcagcctgccttggcctcccaaagtgccgagattgcagtctccgcccggctgccaccccatctgggaagtgaggagcatctctgcctggccgcccatcgtctg

you can easily pull out the sequences based on chromosome name, start, end.

" Again, note that all of these operations were more or less instantaneous (even over the network). These Pygr objects (chr1s) are just abstract representations of the specified pieces of the human genome, so creating these objectsdoes not require Pygr to download all of human chromosome 1. Only when you force it to show a sequence string does Pygr obtain that data over the network – and then only the piece that you need."



4 comments:

  1. later I found Homer can do similar job.
    http://biowhat.ucsd.edu/homer/ngs/homerTools.html

    Extracting Genomic Sequences From FASTA Files
    The extract command can be used to extract large numbers of specific genomic sequence. The first input file you need is a HOMER style peak file or a BED file with genomic locations. Next, you must have the genomic DNA sequences in one of two formats: (1) a directory of chr1.fa, chr2.fa FASTA files (can be masked file like *.fa.masked), or

    (2) a single file FASTA file with all of the chromosomes concatonated in one file. The sequences are sent to stdout as a tab-delimited file, or as a FASTA formatted file if "-fa" is added to the end of the command. Save the output to a file by adding " > outputfile.txt" to the end of the command. The program is run like this:

    homerTools extract [-fa]

    i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ > outputSequences.txt
    Or, to get FASTA files back, i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ -fa > outputSequence.fa

    ReplyDelete
  2. Use a command like this:

    twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit -bed=input.bed test.fa

    or for a single region:

    twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr21 -start=1 -end=10000

    Requires the UCSC tool twoBitToFa, available from http://hgdownload.cse.ucsc.edu/admin/exe/

    ReplyDelete
    Replies
    1. Thank max for the tip. I knew this, and put it in my other post :) http://crazyhottommy.blogspot.com/2015/02/fetch-genomic-sequences-from-coordinates.html

      Delete