Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, April 30, 2015

get all the promoter sequences of human hg19 genome

One of my former friends (a biologist who does not know much about computer) wants me to help her to get all the promoter sequences from the human genome. It is a very simple task and I think all the biologists should know how to do it. There are many ways to do it, but I will show you how to do it using bioconductor.

## get all the promoter sequences for human hg19 genome
## Author: Ming Tang (Tommy)
## Date: 04/30/2015
## load the libraries
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
BSgenome.Hsapiens.UCSC.hg19
# or
Hsapiens
class(Hsapiens)
##BSgenome contains the DNA sequences
#############
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
Txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
Txdb
class(Txdb)
## Genomic features, contains the gene structure information
## all the genes, returns a GRange object
hg19_genes<- genes(Txdb)
hg19_genes
## you can also get all the exons by gene, returns a GRangelist object
exbg<- exonsBy(Txdb, by="gene")
exbg
############### from another database
library(Homo.sapiens)
ghs<- genes(Homo.sapiens)
ghs
############# hg19_genes and ghs are the same, the ENTREZID is the geneID
## get all the transcription start sites (TSSs)
tssgr<- resize( ghs, 1)
## how many bases you want to extract upstream and downstream of every TSSs
## in this case, I want the 1500 bases around TSS
upstream_pad<- 1000
downstream_pad<- 500
promoters<- promoters(tssgr, upstream_pad, downstream_pad)
head(width(promoters))
## in fact, promoters(ghs, upstream_pad, downstream_pad) gives the same result
## because the default site for calcuation is the start(ghs), which is the TSS
### get sequences for all the promoters (1500bp sequences in this case)
## it takes seconds
promoter_seq<- getSeq(Hsapiens, promoters)
## we can annotate the promoter sequences with gene names, symbols in addition to the entrez id.
annotation<- select(Homo.sapiens, key=names(promoter_seq), keytype="ENTREZID",
columns=c("SYMBOL", "ENSEMBL", "ENTREZID"))
dim(annotation)
## more rows than the promoter_seq
## Be aware that there are some duplicated rows for the same ENTREZID, because there are multiple ENSEMBL id for the
## same ENTREZID
## Let's only use the SYMBOL instead
annotation<- select(Homo.sapiens, key=names(promoter_seq), keytype="ENTREZID",
columns=c("SYMBOL", "ENTREZID"))
## Now, it is the one to one mapping
## write to a fasta file
writeXStringSet(promoter_seq, "promoters_1500.fa", append=FALSE,
compress=FALSE,compression_level=NA, format="fasta")
Using less I see all the sequences are there, to make sure the sequences are right, one can manually inspect the UCSC genome browser for several sequences.

I really do not want to dig in (google is your friend) to find a way to write the name using the SYMBOL rather than the ENTREZID....
You can convert the gene ids by many ways too.
I have two posts for that http://crazyhottommy.blogspot.com/2014/09/converting-gene-ids-using-bioconductor.html
and http://crazyhottommy.blogspot.com/2014/09/mapping-gene-ids-with-mygene.html
In addition, I prefer to prepare a bed file for all the promoters using bedtools slop (RefSeq table from UCSC, or from a GENCODE GTF file). Then, use bedtools to extract DNA sequences using bedtools getfasta. To me, it is more flexible on the command lines.
see my previous post here http://crazyhottommy.blogspot.com/2015/02/fetch-genomic-sequences-from-coordinates.html

4 comments:

  1. it's very useful to combine the R package BSgenome.Hsapiens.UCSC.hg19 and TxDb.Hsapiens.UCSC.hg19.knownGene . In fact, most of the bioconductor packages are designed to replace the small script in Linux.
    And I believe you are a Chinese , yes or right ?
    If yes, maybe we can talk more with each other

    ReplyDelete
  2. hi, very usefull, im trying to do the same thing but using maize (Zea mays)

    I want to get promoters and analyse some regulatory elements and plot them

    how do i do?

    greatings

    ReplyDelete
    Replies
    1. not sure if there is a BSgenome package for maize in bioconductor, you can search it.

      Delete