ChIP-seq peak call) , how can you get the sequences of corresponding coordinates? ( you may need the sequences for motif analysis by MEME http://meme.nbcr.net/meme/intro.html )
1. a simple way is using the table browser in UCSC genome browser
|
Question:
"I have a lot of coordinates for an assembly and want to extract the corresponding sequences. What is the best way to proceed?
Response:
There are two ways to extract genomic sequence in batch from an assembly:
A. Download the appropriate fasta files from our ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently. Sequence data for most assemblies is located in the assembly's "chromosomes" subdirectory on the downloads server. For example, the sequence for human assembly hg17 can be found inftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/. You'll find instructions for obtaining our source programs and utilities here. Some programs that you may find useful are nibFrag and twoBitToFa, as well as other fa* programs. To obtain usage information about most programs, execute it without arguments.
B. Use the Table browser to extract sequence. This is a convenient way to obtain small amounts of sequence.
|
2.
Use bedtools
but you will need to download the whole genome sequences in a fasta format.
3. another option will be using python package pygr https://code.google.com/p/pygr/
>>> from pygr import worldbase >>> worldbase.dir('Bio.Seq.Genome.HUMAN') ['Bio.Seq.Genome.HUMAN.hg17', 'Bio.Seq.Genome.HUMAN.hg18', 'Bio.Seq.Genome.HUMAN.hg19']
>>> hg17 = worldbase.Bio.Seq.Genome.HUMAN.hg17()
>>> len(hg17) 46 >>> hg17.keys() ['chr6_random', 'chr19_random', 'chr8_random', 'chrY', 'chrX', 'chr13', 'chr12', 'chr11', 'chr15_random', 'chr17', 'chr16', 'chr15', 'chr14', 'chr19', 'chr18', 'chrM', 'chr1_random', 'chr13_random', 'chr3_random', 'chr6_hla_hap2', 'chr9_random', 'chr22_random', 'chr10', 'chr4_random', 'chr18_random', 'chr2_random', 'chr22', 'chr20', 'chr21', 'chr10_random', 'chr6_hla_hap1', 'chr7', 'chr6', 'chr5', 'chr4', 'chr3', 'chr2', 'chr1', 'chr7_random', 'chrX_random', 'chr9', 'chr8', 'chr16_random', 'chr5_random', 'chr17_random', 'chr12_random'] >>> chr1 = hg17['chr1'] >>> len(chr1) 245522847 >>> s = chr1[100000000:100001000] >>> len(s) 1000 >>> repr(s) 'chr1[100000000:100001000]' >>> print s ATTCAGGCAATGTTGACTTCATTAGTGTTGACAGAAAATAAGCATAAAAATGCAAAACATTGTTGGCTTAACCTGAATATACCTGCATTACCTATTATTAACCAGTTTGTGTTGCCAAAAGACTTATTCCTTGGCATTAAAATGGAGCACTTAAAAATATTTCTAAAAAGCAAATGCCCACACGCTGGTCTTTGCAGCAAATAAGGGTATTTATACTTTTAAAATATTTTAAGTCCATAATTGGATTAATATACACACCTTCTTATGTATAAGGAGTTCAGATCATATAAACACCGTACAATCCAAAAAACCCTACTGAGAATAAAACTAAATAGGCTTATGATAAGAAATACAGATATTCCCATGTATTTACAAATATCATAGACACACAAATTTGGTCAAATACTGTAAAGAAAGAAGAAGtacctgtacctctacctctacctctacctcctctacctcctctacctcctctacctcctctacctcctctacctcctcttcctctacctctacctctacctctacctctacccacggtctccctttccctctctttccacggtctccctctgatgccgagccgaagctggactgtactgctgccatctcggctcactgcaacctccctgcctgattctcctgcctcaacctgccgagtgcctgcgattgcaggcgcgcgccgccacgcctgactggttttcgtatttttttggtggagacggggtttcgctgtgttggccgggctggtctccagctcctaacctcgagtgatccgccagcctcggcctcccgaggtgccgggtttgcagaggagtctcattcactcagtgctcaatggtgcccaggctggagtgcagtggcgtgatctcggctcgctacaacctccacctcccagcagcctgccttggcctcccaaagtgccgagattgcagtctccgcccggctgccaccccatctgggaagtgaggagcatctctgcctggccgcccatcgtctg
you can easily pull out the sequences based on chromosome name, start, end.
" Again, note that all of these operations were more or less instantaneous (even over the network). These Pygr objects (chr1, s) are just abstract representations of the specified pieces of the human genome, so creating these objectsdoes not require Pygr to download all of human chromosome 1. Only when you force it to show a sequence string does Pygr obtain that data over the network – and then only the piece that you need."
later I found Homer can do similar job.
ReplyDeletehttp://biowhat.ucsd.edu/homer/ngs/homerTools.html
Extracting Genomic Sequences From FASTA Files
The extract command can be used to extract large numbers of specific genomic sequence. The first input file you need is a HOMER style peak file or a BED file with genomic locations. Next, you must have the genomic DNA sequences in one of two formats: (1) a directory of chr1.fa, chr2.fa FASTA files (can be masked file like *.fa.masked), or
(2) a single file FASTA file with all of the chromosomes concatonated in one file. The sequences are sent to stdout as a tab-delimited file, or as a FASTA formatted file if "-fa" is added to the end of the command. Save the output to a file by adding " > outputfile.txt" to the end of the command. The program is run like this:
homerTools extract [-fa]
i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ > outputSequences.txt
Or, to get FASTA files back, i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ -fa > outputSequence.fa
Use a command like this:
ReplyDeletetwoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit -bed=input.bed test.fa
or for a single region:
twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr21 -start=1 -end=10000
Requires the UCSC tool twoBitToFa, available from http://hgdownload.cse.ucsc.edu/admin/exe/
Thank max for the tip. I knew this, and put it in my other post :) http://crazyhottommy.blogspot.com/2015/02/fetch-genomic-sequences-from-coordinates.html
DeleteClear answers, thank you!
ReplyDelete