Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, April 30, 2015

get all the promoter sequences of human hg19 genome

One of my former friends (a biologist who does not know much about computer) wants me to help her to get all the promoter sequences from the human genome. It is a very simple task and I think all the biologists should know how to do it. There are many ways to do it, but I will show you how to do it using bioconductor.

## get all the promoter sequences for human hg19 genome
## Author: Ming Tang (Tommy)
## Date: 04/30/2015
## load the libraries
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
BSgenome.Hsapiens.UCSC.hg19
# or
Hsapiens
class(Hsapiens)
##BSgenome contains the DNA sequences
#############
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
Txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
Txdb
class(Txdb)
## Genomic features, contains the gene structure information
## all the genes, returns a GRange object
hg19_genes<- genes(Txdb)
hg19_genes
## you can also get all the exons by gene, returns a GRangelist object
exbg<- exonsBy(Txdb, by="gene")
exbg
############### from another database
library(Homo.sapiens)
ghs<- genes(Homo.sapiens)
ghs
############# hg19_genes and ghs are the same, the ENTREZID is the geneID
## get all the transcription start sites (TSSs)
tssgr<- resize( ghs, 1)
## how many bases you want to extract upstream and downstream of every TSSs
## in this case, I want the 1500 bases around TSS
upstream_pad<- 1000
downstream_pad<- 500
promoters<- promoters(tssgr, upstream_pad, downstream_pad)
head(width(promoters))
## in fact, promoters(ghs, upstream_pad, downstream_pad) gives the same result
## because the default site for calcuation is the start(ghs), which is the TSS
### get sequences for all the promoters (1500bp sequences in this case)
## it takes seconds
promoter_seq<- getSeq(Hsapiens, promoters)
## we can annotate the promoter sequences with gene names, symbols in addition to the entrez id.
annotation<- select(Homo.sapiens, key=names(promoter_seq), keytype="ENTREZID",
columns=c("SYMBOL", "ENSEMBL", "ENTREZID"))
dim(annotation)
## more rows than the promoter_seq
## Be aware that there are some duplicated rows for the same ENTREZID, because there are multiple ENSEMBL id for the
## same ENTREZID
## Let's only use the SYMBOL instead
annotation<- select(Homo.sapiens, key=names(promoter_seq), keytype="ENTREZID",
columns=c("SYMBOL", "ENTREZID"))
## Now, it is the one to one mapping
## write to a fasta file
writeXStringSet(promoter_seq, "promoters_1500.fa", append=FALSE,
compress=FALSE,compression_level=NA, format="fasta")
Using less I see all the sequences are there, to make sure the sequences are right, one can manually inspect the UCSC genome browser for several sequences.

I really do not want to dig in (google is your friend) to find a way to write the name using the SYMBOL rather than the ENTREZID....
You can convert the gene ids by many ways too.
I have two posts for that http://crazyhottommy.blogspot.com/2014/09/converting-gene-ids-using-bioconductor.html
and http://crazyhottommy.blogspot.com/2014/09/mapping-gene-ids-with-mygene.html
In addition, I prefer to prepare a bed file for all the promoters using bedtools slop (RefSeq table from UCSC, or from a GENCODE GTF file). Then, use bedtools to extract DNA sequences using bedtools getfasta. To me, it is more flexible on the command lines.
see my previous post here http://crazyhottommy.blogspot.com/2015/02/fetch-genomic-sequences-from-coordinates.html

Thursday, April 23, 2015

simulation of distribution of means draw from exponential distribution

"central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution." [1]

I am going to draw 40 numbers from exponential distribution [2] (you can do it for any distribution) for 1000 times and examine the distribution of the means. In R, you can do it by rexp(40, lambda),  where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations for this specific test.

see a gist below.

## Overview
# central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.
# I am going to draw 40 numbers from exponential distribution for 1000 times, and calcuate the mean
# of each draw (we will have 1000 means), and through this simulation to test if the
# distribution of the means will be normal or not.
## start simulation
# number of simulation, sample size and lambda
nosim<- 1000
n<- 40
lambda<- 0.2
# make a matrix containing 1000 rows and 40 columns with simulated value
mat<- matrix(rexp(nosim * n, lambda), nosim)
# calculate the mean of the 40 numbers draw from exponential distritubtion
# for each row (total 1000 rows), and then take the mean
# we got 1000 means of the 1000 simulation
# now let's calculate the standard deviation
mean_expo<- apply(mat, 1, mean)
## the sample mean and compare it to the theoretical mean of the distribution
#sample mean
mean(mean_expo)
# 5.1
# theroretical mean
1/lambda
# 5.017575
# they are very close to each other.
## how variable the sample is (via variance) and compare it to the theoretical variance of the distribution?
# sample variance
var(mean_expo)
# 0.6469216
# thoretical variance
(1/lambda)^2 * 1/n
# 0.625
# they are very close to each other. you may get a different mean and variance for the sample, but if I set the seed. it should be reproducible
# Let's look at the distribution of the means by histogram
library(ggplot2)
g<- ggplot(data.frame(mean_expo)) + geom_histogram(aes(x=mean_expo, y=..density..),
fill="black", color="white")
g<- g + geom_vline(xintercept=5, color="red")
g<- g + geom_vline(xintercept= mean(mean_expo), color="green")
g<- g + stat_function(fun = dnorm, args= list(mean= 1/lambda, sd= 1/lambda * 1/sqrt(40)),
color="red")
g<- g + geom_density(aes(x= mean_expo,y=..density..), color="green")
g
# the distribution of means should be centered on 1/lambda = 1/0.2 = 5 (the theoretical center).
# the theoretical standard deviation should be sigma/sqrt(n).
# the red bell curve is the density function of Normal(1/lambda, 1/lambda * 1/sqrt(40)).
# the red vertical line is the theoretical mean (5).
# the green vertical line is the sample mean(5.036)
# the green bell curve is the density of the simulated sample means.
# we can see they are very close to each other.
## compare with the large number of exponential distribution
expo<- rexp(1000, lambda)
g<- ggplot(data.frame(expo)) + geom_histogram(aes(x=expo, y=..density..),
fill="black", color="white")
g<- g + geom_vline(xintercept=5, color="red")
g<- g + stat_function(fun = dnorm, args= list(mean= 1/lambda, sd= 1/lambda * 1/sqrt(40)),
color="red")
g<- g + geom_density(aes(x= expo,y=..density..), color="green")
g
# clearly, we see it is not normal distributed
view raw simulation.r hosted with ❤ by GitHub




[1] http://en.wikipedia.org/wiki/Central_limit_theorem
[2] http://en.wikipedia.org/wiki/Exponential_distribution

Monday, April 6, 2015

My first Software Carpentry workshop as an instructor

I just came back from the software-carpentry workshop held on April 2 and 3 at the University of Miami. The workshop link is here http://xuf12.github.io/2015-04-02-umiami/
and the Etherpad:https://etherpad.mozilla.org/2015-04-02-umiami

I taught Shell and R on the morning sessions and I enjoyed it very much. Most of the participants (biologists) are beginners for Shell and R, so I taught the very basics and keep my pace slow. It looks like they also enjoyed learning with me. It was my first time teaching to such a group of (around 15) people and I got myself refreshed on the basics.

The organizer Sawsan is a very nice lady and we enjoyed talking about teaching and (data) science. I met the other three instructors: Xu Fei, Matthew and Ashiwin. They are all wonderful instructors and I learned a lot from them. I also leaned a lot from the Git and SQL sessions. I think I should at least finish the online instructor training by Greg Wilson. I was in a job transition and missed the last session.  I wish I could attend more such workshops to refine my teaching skills.

Despite it was my first time teaching, I got encouraged by the students comments:


I still have a lot to improve though. I will need to speak better english (As a non-native speaker, it sometimes can be quite challenging). I am happy that I made most of the students understood :)