Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, June 23, 2015

RPKM/FPKM, TPM and raw counts for RNA-seq

There was a hot discussion that RPKM should not be used in the RNA-seq analysis:
blahah404  Woohoo - death to FPKM and RPKM!t.co/CyR3ht1dqi

However  has a different opinion, he thinks FPKM makes sense.
Why FPKM makes sense

what is RPKM or FPKM? see blog posts below:
update on 07/23/2015, see a youtube video  for great explanation of the three.
What is the FPKM? A review of RNA-Seq expression units
Counts vs. FPKMs in RNA-seq
Paper: RPKM measure is inconsistent among samples
FPKM/RPKM normalization caveat and upper quartile normalization
Finally a post from Lior Pachter
Estimating number of transcripts from RNA-Seq measurements (and why I believe in paywall)

I personally use raw counts and let DESeq2 to deal with normalization.

update on 07/08/2016.
Salmon and Kallisto outputs TPM, but if you want to convert counts to TPM, read this biostar post, and my post comparing salmon, kallisto and HTseq.

#' Convert counts to transcripts per million (TPM).
#'
#' Convert a numeric matrix of features (rows) and conditions (columns) with
#' raw feature counts to transcripts per million.
#'
#' Lior Pachter. Models for transcript quantification from RNA-Seq.
#' arXiv:1104.3889v2
#'
#' Wagner, et al. Measurement of mRNA abundance using RNA-seq data:
#' RPKM measure is inconsistent among samples. Theory Biosci. 24 July 2012.
#' doi:10.1007/s12064-012-0162-3
#'
#' @param counts A numeric matrix of raw feature counts i.e.
#' fragments assigned to each gene.
#' @param featureLength A numeric vector with feature lengths.
#' @param meanFragmentLength A numeric vector with mean fragment lengths.
#' @return tpm A numeric matrix normalized by library size and feature length.
counts_to_tpm <- function(counts, featureLength, meanFragmentLength) {
# Ensure valid arguments.
stopifnot(length(featureLength) == nrow(counts))
stopifnot(length(meanFragmentLength) == ncol(counts))
# Compute effective lengths of features in each library.
effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) {
featureLength - meanFragmentLength[i] + 1
}))
# Exclude genes with length less than the mean fragment length.
idx <- apply(effLen, 1, function(x) min(x) > 1)
counts <- counts[idx,]
effLen <- effLen[idx,]
featureLength <- featureLength[idx]
# Process one column at a time.
tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
rate = log(counts[,i]) - log(effLen[,i])
denom = log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}))
# Copy the row and column names from the original matrix.
colnames(tpm) <- colnames(counts)
rownames(tpm) <- rownames(counts)
return(tpm)
}
view raw counts_to_tpm.R hosted with ❤ by GitHub

No comments:

Post a Comment