Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Monday, July 11, 2016

Comparing salmon, kalliso and STAR-HTseq RNA-seq quantification

Why I am testing again?

However, I just got several RNA-seq data to play with, and I think it is a good time-point for me to get my hands wet on those RNA-seq quantification tools (especially those alignment-free ones) and get a personal idea of how different tools perform. I am not doing bench-marking, as one should simulate the RNA-seq reads by e.g. polyester to have the ground truth.

UPDATE 08/22/16, a nice post by Tom Smith compared alignment dependent and independent RNA-seq quantification using simulated data: why you should use alignment independent quantification for RNA-seq.
Choosing alignment based tools (such as tophat, STAR, bowtie, HISAT) or alignment free ones depends on the purpose of your study. Salmon and kallisto requires the reads "pesudo-map" to the transcriptome, so one has to provide a fasta file containing all the transcripts you want to quantify. Therefore, if you want to find novel transcripts, you probably should go with the alignment based methods. It is also shown recently that Widespread intron retention diversifies most cancer transcriptomes. If you want to do similar things, you need to use mappers with the genome (not the transcriptiome) as a reference.
kallisto can output a pseudo-bam which can be useful for some people. Salmon will have the same functionality in the next release according to Rob.
Now, let's begin my analysis.
I skipped quality trimming for this analysis. As it was shown that trimming may not necessary to be a good thing:
Trimming of sequence reads alters RNA-Seq gene expression estimates
For assembly On the optimal trimming of high-throughput mRNA sequence data.
However, as a rule of thumb, one needs to always check reads quality of any sequencing data sets. e.g. RSeQC: An RNA-seq Quality Control Package

Testing Salmon for RNA-seq quantification

Download the binary(v0.6.0) by:
wget https://github.com/COMBINE-lab/salmon/releases/download/v0.6.0/SalmonBeta-0.6.0_DebianSqueeze.tar.gz
tar xvzf SalmonBeta-0.6.0_DebianSqueeze.tar.gz
Read the documentation on how to use it here
To use Salmon in quasi-mapping-based mode, then you first have to build an Salmon index for your transcriptome.Ensemble release 75 is the latest version for GRCh37 (hg19). starting from release 76 it is GRCh38 (hg38)
Download the hg19 version of cDNA and non-coding RNA fasta:
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz  
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh37.75.ncrna.fa.gz
## merge together 

gunzip -c Homo_sapiens.GRCh37.75.cdna.all.fa.gz Homo_sapiens.GRCh37.75.ncrna.fa.gz > Homo_sapiens.GRCh37.75.cdna.ncrna.fa
First, you run the Salmon indexer:
Default index --- The quasi index has been made the default type. This means that it is no longer necessary to provide the --type option to the index command. The fmd index remains enabled, but may be removed in a future version.
salmon index -t Homo_sapiens.GRCh37.75.cdna.ncrna.fa -i Homo_sapiens.GRCh37.75_quasi_index 
It finished in a little over 10 mins.
quantify the transcript:
The RNA-seq library I am analyzing is single-end stranded, reads from reverse strand. so I specified -l SR. readsalmon doc for different library types. I especially like the figure representation below:
salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat 3R_S18_L002_R1_001.fastq.gz) -o 3R_transcripts_quant

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat 50R_S19_L002_R1_001.fastq.gz) -o 50R_transcripts_quant

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat WT_S17_L002_R1_001.fastq) -o WT_transcripts_quant
The quantification finishes within minutes!

quantification in gene level

Salmon can also give gene-level quantification as long as feed a gtf file. In addtion, I specify the fragment length to be 200 bp (default) and standard deviation of 20 (default 80). Details here. For paired-end RNA-seq, the fragment length distribution can be infered from the fastq files, but for single-end data, it needs to be specified.
wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat 3R_S18_L002_R1_001.fastq.gz) -o 3R_transcripts_quant -g ~/annotations/Homo_sapiens.GRCh37.75.gtf --fldMean 200 --fldSD 20

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat 50R_S19_L002_R1_001.fastq.gz) -o 50R_transcripts_quant -g ~/annotations/Homo_sapiens.GRCh37.75.gtf --fldMean 200 --fldSD 20

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat WT_S17_L002_R1_001.fastq.gz) -o WT_transcripts_quant -g ~/annotations/Homo_sapiens.GRCh37.75.gtf --fldMean 200 --fldSD 20
It is recommended using tximport to get the gene-level quantification. I asked the difference between tximport andsalmon quant -g. The developer of salmon @Rob Patro answered:
Main diffs I can think of (1) in R
(2) integrated with DESeq2
(3) Can derive multi-sample effective gene lengths

testing kallisto for quantification.

Build index first (I am using v0.43.0):
kallisto index -i Homo_sapiens.GRCh37.75.cdna.ncrna.kalisto.idx Homo_sapiens.GRCh37.75.cdna.ncrna.fa
It took me around 15 mins to build the index.
quantification:
-l, --fragment-length=DOUBLE Estimated average fragment length
-s, --sd=DOUBLE Estimated standard deviation of fragment length (default: value is estimated from the input data)
In the case of single-end reads, the -l option must be used to specify the average fragment length. Typical Illumina libraries produce fragment lengths ranging from 180–200 bp but it’s best to determine this from a library quantification with an instrument such as an Agilent Bioanalyzer.
see a question by James on the google group
Common values for single end reads are insert length 200 and sd 20. If you have any better information, like the person who prepped the library or better yet, data from bioanalyzer that will of course be better.
kallisto can take .gz files.
kallisto quant -t 10 -i ~/annotations/Homo_sapiens.GRCh37.75.cdna.ncrna.kalisto.idx -o 3R_kaliso_output --single -l 200 -s 20 3R_S18_L002_R1_001.fastq.gz

kallisto quant -t 10 -i ~/annotations/Homo_sapiens.GRCh37.75.cdna.ncrna.kalisto.idx -o 50R_kaliso_output --single -l 200 -s 20 50R_S19_L002_R1_001.fastq.gz

kallisto quant -t 10 -i ~/annotations/Homo_sapiens.GRCh37.75.cdna.ncrna.kalisto.idx -o WT_kaliso_output --single -l 200 -s 20 WT_S17_L002_R1_001.fastq.gz
Finished in ~6 mins. again, blazing fast as Salmon does.

counts versus TPM/RPKM/FPKM

I will need to quote from this blog post on explaning technical differences among RPKM, FPKM and TPM.
These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM:
  1. Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
  2. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
  3. Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.
FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq
TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:
  1. Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
  2. Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
  3. Divide the RPK values by the “per million” scaling factor. This gives you TPM.
So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.
When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.
countToTpm <- function(counts, effLen)
{
    rate <- log(counts) - log(effLen)
    denom <- log(sum(exp(rate)))
    exp(rate - denom + log(1e6))
}

countToFpkm <- function(counts, effLen)
{
    N <- sum(counts)
    exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}

fpkmToTpm <- function(fpkm)
{
    exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}

countToEffCounts <- function(counts, len, effLen)
{
    counts * (len / effLen)
}

################################################################################
# An example
################################################################################
cnts <- c(4250, 3300, 200, 1750, 50, 0)
lens <- c(900, 1020, 2000, 770, 3000, 1777)
countDf <- data.frame(count = cnts, length = lens)

# assume a mean(FLD) = 203.7
countDf$effLength <- countDf$length - 203.7 + 1
countDf$tpm <- with(countDf, countToTpm(count, effLength))
countDf$fpkm <- with(countDf, countToFpkm(count, effLength))
with(countDf, all.equal(tpm, fpkmToTpm(fpkm)))
countDf$effCounts <- with(countDf, countToEffCounts(count, length, effLength))
My colleage @samir processed the same data using STAR-HTseq piepline, and it gives the raw counts for each gene. He is very skepitcal about transcript level quantification and would focus on gene-level for now. What adds the complexity a bit is that he used gencode v19 as annotation and the gene name has a dot + digits in the end of each gene. e.g. ENSG00000000003.10 vs ENSG00000000003 in gtf files downloaded from ensemble. I will need to get rid of the digits in the end.
I will need to convert the raw counts from the STAR-HTseq pipeline to TPM for comparison as Salmon and kallistooutput TPM and estimated counts. Read the post: convert counts to TPMKamil Slowikowski wrote a function to convert counts to TPM, and the function involves an effective length of the features.

what is effective length of the feature?

It is quite mathematical, but the general idea is:
If we take the fragment length to be fixed, then the effective length is how many fragments can occur in the transcript. This turns out to be length - frag_len +1. The number of fragments coming from a transcript will be proportional to this number, regardless of whether you sequenced one or both ends of the fragment. In turn, when it comes to probabilistically assigning reads to transcripts the effective length plays a similar role again. Thus for short transcripts, there can be quite a difference between two fragment lengths. To go back to your example if you have transcript of length 310, your effective length is 10 (if fragment length is 300) or 160 (if fragment length is 150) in either case, which explains the discrepancy you see.
From @Rob
The effective length is computed by using the fragment length distribution to determine the effective number of positions that can be sampled on each transcript. You can think of this as convolving the fragment length distribution with the characteristic function (the function that simply takes a value of 1) over the transcript. For example if we observe fragments of length 50 --- 1000, a position more than 1000 bases from the end of the transcript will contribute a value of 1 to the effective length, while a position 150 bases will contribute a value of F(150), where F is the cumulative distribution function of the fragment length distribution. For single end data, where we can't learn an empirical FLD, we use a gaussian whose mean and standard deviation can be set with --fldMean and --fldSD respectively.
From Harold Pimentel's post above. He is in Lior Pachter's group.
Effective length refers to the number of possible start sites a feature could have generated a fragment of that particular length. In practice, the effective length is usually computed as: 
where uFDL is the mean of the fragment length distribution which was learned from the aligned read. If the abundance estimation method you’re using incorporates sequence bias modeling (such as eXpress or Cufflinks), the bias is often incorporated into the effective length by making the feature shorter or longer depending on the effect of the bias.

R scripts to convert HTseq counts to TPM

library(dplyr)
setwd("/Volumes/mdarisngc03.mdanderson.edu/scratch/RNA-seq-26357338/")

list.files()

counts.3R<- read.table("STAR_3R-30390482_htseq.cnt", sep="\t", header=F, stringsAsFactors = F)
counts.50R<- read.table("STAR_50R-30393469_htseq.cnt", sep="\t", header=F, stringsAsFactors = F)
counts.WT<- read.table("STAR_WT-30393468_htseq.cnt", sep="\t", header=F, stringsAsFactors = F)

counts.df<- cbind(counts.WT, counts.3R, counts.50R)
counts.df<- counts.df[, c(1,2,4,6)]
names(counts.df)<- c("gene_name", "HTseq.WT", "HTseq.3R", "HTseq.50R")

## get rid of the digits in the end of the gene_name
counts.df<- counts.df %>% mutate(gene_name = gsub("\\.[0-9]+", "", gene_name)) 
To convert the raw counts from HTSeq (gencode v19 as annotation), I will need the length of each gene. Although one can compute the gene length from the gtf files, the gene-level output of Salmon has already computed it for me. I will just need to "borrow" from there.
WT.Salmon.gene.quant<- read.table("./WT-30393468/WT_transcripts_quant/quant.genes.sf", sep="\t",
                                  header=T, stringsAsFactors = F)
gene.length<- dplyr::select(WT.Salmon.gene.quant, c(Name, Length))

counts.df<- inner_join(counts.df, gene.length, by =c("gene_name"="Name")) 
counts_to_tpm function from https://www.biostars.org/p/171766/
counts_to_tpm <- function(counts, featureLength, meanFragmentLength) {

  # Ensure valid arguments.
  stopifnot(length(featureLength) == nrow(counts))
  stopifnot(length(meanFragmentLength) == ncol(counts))

  # Compute effective lengths of features in each library.
  effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) {
    featureLength - meanFragmentLength[i] + 1
  }))

  # Exclude genes with length less than the mean fragment length.
  idx <- apply(effLen, 1, function(x) min(x) > 1)
  counts <- counts[idx,]
  effLen <- effLen[idx,]
  featureLength <- featureLength[idx]

  # Process one column at a time.
  tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
    rate = log(counts[,i]) - log(effLen[,i])
    denom = log(sum(exp(rate)))
    exp(rate - denom + log(1e6))
  }))

  # Copy the column names from the original matrix.
  colnames(tpm) <- colnames(counts)
  return(tpm)
}

featureLength<- counts.df$Length

## fragment length is 200bp 
meanFragmentLength<- c(200, 200, 200)

counts<- as.matrix(counts.df[,c(2,3,4)])
rownames(counts)<- counts.df$gene_name

TPM.from.HTSeq<- counts_to_tpm(counts, featureLength, meanFragmentLength)

Import salmon and kallisto transcript level

Use tximport to import salmon and kallisto transcript level quantification
#source("https://bioconductor.org/biocLite.R")
#biocLite("EnsDb.Hsapiens.v75")

#create a tx2gene.txt table
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

Tx.ensemble <- transcripts(edb,
          columns = c("tx_id", "gene_id", "gene_name"),
          return.type = "DataFrame")
nrow(Tx.ensemble)

tx2gene<- Tx.ensemble[,c(1,2)]

samples<- c("WT-30393468", "3R-30390482", "50R-30393469")
salmon.dir<- c("WT_transcripts_quant", "3R_transcripts_quant", "50R_transcripts_quant")
kallisto.dir<- c("WT_kaliso_output", "3R_kaliso_output", "50R_kaliso_output")

kallisto.files <- file.path(samples, kallisto.dir, "abundance.tsv")
names(kallisto.files)<- c("kallisto.WT", "kallisto.3R", "kallisto.50R")

salmon.files<- file.path(samples, salmon.dir, "quant.sf")
names(salmon.files)<- c("salmon.WT", "salmon.3R", "salmon.50R")

all(file.exists(kallisto.files))
all(file.exists(salmon.files))

library(tximport)
library(readr)

tx.kallisto <- tximport(kallisto.files, type = "kallisto", tx2gene = tx2gene, 
                        reader = read_tsv, countsFromAbundance = "lengthScaledTPM")
tx.salmon <- tximport(salmon.files, type = "salmon", tx2gene = tx2gene, 
                      reader = read_tsv, countsFromAbundance = "lengthScaledTPM")

compare STAR-HTseq, kallisto and salmon

## merge the counts table together from HTSeq, salmon and kallisto
head(counts.df)
salmon.counts<- tx.salmon$counts
salmon.counts<- as.data.frame(salmon.counts)
salmon.counts$gene_name<- rownames(salmon.counts)

kallisto.counts<- tx.kallisto$counts
kallisto.counts<- as.data.frame(kallisto.counts)
kallisto.counts$gene_name<- rownames(kallisto.counts)

HTSeq.salmon.counts<- inner_join(counts.df, salmon.counts)
HTSeq.salmon.kallisto.counts<- inner_join(HTSeq.salmon.counts, kallisto.counts) %>% dplyr::select(-Length)

### counts correlation for WT sample only

library(ggplot2)

ggplot(HTSeq.salmon.kallisto.counts,aes(x=log2(HTseq.WT+1), y=log2(salmon.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=15, y=20, label= "spearman cor = 0.86") + ggtitle("HTSeq counts versus salmon counts")

cor(log2(HTSeq.salmon.kallisto.counts$HTseq.WT+1), log2(HTSeq.salmon.kallisto.counts$salmon.WT+1), method="spearman")


ggplot(HTSeq.salmon.kallisto.counts,aes(x=log2(HTseq.WT+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=15, y=20, label= "spearman cor = 0.83") + ggtitle("HTSeq counts versus kallisto counts") 

cor(log2(HTSeq.salmon.kallisto.counts$HTseq.WT+1), log2(HTSeq.salmon.kallisto.counts$kallisto.WT+1), method="spearman")

ggplot(HTSeq.salmon.kallisto.counts,aes(x=log2(salmon.WT+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=15, y=20, label= "spearman cor = 0.94") + ggtitle("salmon counts versus kallisto counts") 

cor(log2(HTSeq.salmon.kallisto.counts$salmon.WT+1), log2(HTSeq.salmon.kallisto.counts$kallisto.WT+1), method="spearman")


### merge the TPM table together from HTSeq, salmon and kallisto
head(TPM.from.HTSeq)
TPM.from.HTSeq<- as.data.frame(TPM.from.HTSeq)
TPM.from.HTSeq$gene_name<- rownames(TPM.from.HTSeq)

salmon.TPM<- tx.salmon$abundance
salmon.TPM<- as.data.frame(salmon.TPM)
salmon.TPM$gene_name<- rownames(salmon.TPM)

kallisto.TPM<- tx.kallisto$abundance
kallisto.TPM<- as.data.frame(kallisto.TPM)
kallisto.TPM$gene_name<- rownames(kallisto.TPM)

HTSeq.salmon.TPM<- inner_join(TPM.from.HTSeq, salmon.TPM) 
## re-order the columns
HTSeq.salmon.kallisto.TPM<- inner_join(HTSeq.salmon.TPM, kallisto.TPM) %>% dplyr::select(c(4,1:3,5:10)) 

## plot TPM correlation
ggplot(HTSeq.salmon.kallisto.TPM,aes(x=log2(HTseq.WT+1), y=log2(salmon.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=12, y=15, label= "spearman cor = 0.84") + ggtitle("HTSeq TPM versus salmon TPM")

cor(log2(HTSeq.salmon.kallisto.TPM$HTseq.WT+1), log2(HTSeq.salmon.kallisto.TPM$salmon.WT+1), method="spearman")

ggplot(HTSeq.salmon.kallisto.TPM, aes(x=log2(HTseq.WT+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=12, y=15, label= "spearman cor = 0.83") + ggtitle("HTSeq TPM versus kallisto TPM")

cor(log2(HTSeq.salmon.kallisto.TPM$HTseq.WT+1), log2(HTSeq.salmon.kallisto.TPM$kallisto.WT+1), method="spearman")

ggplot(HTSeq.salmon.kallisto.TPM,aes(x=log2(salmon.WT+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=12, y=15, label= "spearman cor = 0.95") + ggtitle("salmon TPM versus kallisto TPM")

cor(log2(HTSeq.salmon.kallisto.TPM$salmon.WT+1), log2(HTSeq.salmon.kallisto.TPM$kallisto.WT+1), method="spearman")
6 5 4 3 2 1

Conclusions

Kallisto and Salmon seems to have very tight correlation. STAR-HTSeq based RNA-seq pipeline is a bit off when correlated with the other two. Note that I did not do adaptor and low qualit base trimming, STAR may discard some informative reads. In contrast, salmon (and kallisto?) is very robust to quality and adapter trimming. If you have other thoughs, please comment :) I am sorry that my analysis is not fully reproducible as I can not share the data now.
by Rob
One of the benefits of the quasi-mapping approach taken by sailfish is that it is rather robust to quality and adapter trimming. One way you can assess this is by looking at the mapping rate (i.e. the fraction of reads that are mapped) --- which appears in the comments of the main output file "quant.sf". You may be able to recover a small fraction of extra reads with quality / adapter trimming, but generally this is not necessary for Sailfish to map reads accurately for quantification purposes.

UPDATE 07/24/16 Robs suggested that "taking the aggregate feature length (i.e. the gene length) and then correcting it may not always yield the same result as correcting the feature length (i.e. the transcript lengths) and then aggregating them to the gene level." see comment below.
I then re-calculated TPM from counts using the method he suggested, and the correlation looks better, but spearman correlation does not improve.
get rid of the digits (gene version) in the end for the gene names (gencode v19)
cat STAR_WT-30393468_htseq.cnt| sed -E 's/\.[0-9]+//' > WT_htseq.cnt
transcript to gene mapping file:
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

Tx.ensemble <- transcripts(edb,
          columns = c("tx_id", "gene_id", "gene_name"),
          return.type = "DataFrame")
nrow(Tx.ensemble)

tx2gene<- Tx.ensemble[,c(1,2)]

write.table(tx2gene, "~/playground/gene-level-effective-length/tx2gene.tsv", col.names = F, row.names = F, sep="\t", quote=F)
add #! /bin/env python to the head of the python script Rob wrote.
#! /bin/env python
import argparse
import pandas as pd
import numpy as np
import sys

def main(args):
    gtable = pd.read_table(args.ginput).set_index('Name')
    ttable = pd.read_table(args.tinput).set_index('Name')
    tgmap = pd.read_table(args.tgmap, names=['t', 'g']).set_index('t')
    gene_lengths = {}
    j = 0

    # Map over all gene groups (a gene and its associated transcripts)
    for g, txps in tgmap.groupby('g').groups.iteritems():
        if j % 500 == 1:
            print("Processed {} genes".format(j))
        j += 1
        # The set of transcripts present in our salmon index
        tset = []
        for t in txps:
            if t in ttable.index:
                tset.append(t)
        # If at least one of the transcripts was present
        if len(tset) > 0:
            # The denominator is the sum of all TPMs
            totlen = ttable.loc[tset,'TPM'].sum()
            # Turn the relative TPMs into a proper partition of unity
            if totlen > 0:
                tpm_fracs = ttable.loc[tset, 'TPM'].values / ttable.loc[tset,'TPM'].sum()
            else:
                tpm_fracs = np.ones(len(tset)) / float(len(tset))
            # Compute the gene's effective length as the abundance-weight
            # sum of the transcript lengths
            elen = 0.0
            for i,t in enumerate(tset):
                elen += tpm_fracs[i] * ttable.loc[t, 'EffectiveLength']
            gene_lengths[g] = elen

    # Give the table an effective length field
    gtable['EffectiveLength'] = gtable.apply(lambda r : gene_lengths[r.name] if r.name in gene_lengths else 1.0, axis=1)
    # Write it to the output file
    gtable.to_csv(args.output, sep='\t', index_label='Name')

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Compute gene-level effective lengths")
    parser.add_argument('--ginput', type=str, help='gene level input table')
    parser.add_argument('--tinput', type=str, help='transcript level input table')
    parser.add_argument('--tgmap', type=str, help='transcript -> gene mapping')
    parser.add_argument('--output', type=str, help='output table with extra column')

    args = parser.parse_args()
    main(args)
chmod u + x glef.py

/glef.py --ginput WT_htseq.cnt --tinput WT_quant.sf --tgmap tx2gene.tsv --output WT_htseq_effective_length.tsv
compare effecitve length derived from transcript level and salmon output gene level
WT.count.df<- read.table("~/playground/gene-level-effective-length/WT_htseq_effective_length.tsv",
                         sep ="\t", header=T, stringsAsFactors = F)

head(counts.df)

counts.effective.df<- left_join(counts.df, WT.count.df, by=c("gene_name"="Name"))

## scatter plot of effective length from gene level output of salmon and effective length derived from transcript level salmon output

plot(log2(counts.effective.df$EffectiveLength.x+1), log2(counts.effective.df$EffectiveLength.y+1), xlab= "log2 effective gene length from salmon gene level output", ylab="log2 effective gene length derived from transcript level")

abline(a=0, b=1, col="red")
## convert to TPM use transcript derived effective gene length
WT_count<- as.matrix(counts.effective.df[,2])

count2Tpm <- function(counts, effLen)
{
    rate <- log(counts) - log(effLen)
    denom <- log(sum(exp(rate)))
    exp(rate - denom + log(1e6))
}

WT.TPM.from.HTSeq<- count2Tpm(WT_count, counts.effective.df$EffectiveLength.y)
WT.TPM.from.HTSeq<- cbind(as.data.frame(WT.TPM.from.HTSeq), gene_name=counts.effective.df$gene_name) 
colnames(WT.TPM.from.HTSeq)<- c("WT.htseq.TPM", "gene_name")

head(TPM.from.HTSeq)
dim(salmon.TPM)

WT.TPM.HTseq.salmon<- left_join(WT.TPM.from.HTSeq, salmon.TPM)
WT.TPM.HTseq.salmon.kallisto<- left_join(WT.TPM.HTseq.salmon, kallisto.TPM)

## HTSeq TPM versus salmon TPM
ggplot(WT.TPM.HTseq.salmon.kallisto,aes(x=log2(WT.htseq.TPM+1), y=log2(salmon.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=13, y=20, label= "spearman cor = 0.85") + ggtitle("HTseq TPM versus salmon TPM") 

cor(log2(WT.TPM.HTseq.salmon$WT.htseq.TPM+1), log2(WT.TPM.HTseq.salmon$salmon.WT+1), method="spearman")
## HTSeq TPM versus kallisto TPM

ggplot(WT.TPM.HTseq.salmon.kallisto,aes(x=log2(WT.htseq.TPM+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=13, y=20, label= "spearman cor = 0.80") + ggtitle("HTseq TPM versus kallisto TPM") 

cor(log2(WT.TPM.HTseq.salmon.kallisto$WT.htseq.TPM+1), log2(WT.TPM.HTseq.salmon.kallisto$kallisto.WT+1), method="spearman")

7 comments:

  1. I'm a bit surprised by the correlation between HTSeq and kallisto/salmon. I would have thought it would have been higher. Its hard to know which one is "right", however, since kalliston and salmon are both based on k-mer idea and so not as surprising that they would be more correlated than two algorithms using different methods. I would be curious to add one more different method, such as Cufflinks, or to compare with samples that are run on microarrays.

    ReplyDelete
    Replies
    1. I am a bit surprised as well. As I noted, it is not a benchmarking experiment, otherwise I have to simulate the RNAseq reads. Another variable is that the STAR-HTseq was done by my colleague. He used gencode v19 as annotation while I am using ensemble v75. I'd like to include cufflinks as well, but I have no experience with it. I heard that it takes very long time to map with Tophat.

      Delete
    2. One thing that strikes me about these differences is that there seems to be much more agreement between Salmon/Kallisto and STAR-HTseq in terms of read counts than in terms of TPM. While this may be due, in part, to the extra correlation added by the fact that raw counts do not account for feature "length", this also suggests potential differences in the effective length values being used to derive TPMs from counts.

      One potential confound here is that taking the aggregate feature length (i.e. the gene length) and then correcting it may not always yield the same result as correcting the feature length (i.e. the transcript lengths) and then aggregating them to the gene level. Consider, for example, a transcript shorter than the mean fragment length (which therefore goes uncorrected — at least using the model above) that contributes to a gene whose aggregate length is longer than the mean fragment length (and thus it is corrected). One way to try to mitigate this effect is to take the transcript-level effective lengths provided by e.g. salmon, and then compute gene level "effective" lengths directly. The way to do this is simply to take a weighted combination of the transcript effective lengths, where the weighting is given by the within-gene relative abundance (i.e. TPM). For example, consider a gene with three transcripts with tpms TPM_1, TPM_2, and TPM_3 and *effective length* l_1, l_2, l_3. Then, let S = TPM_1 + TPM_2 + TPM_3. You could compute the gene level *effective* length as [(TPM_1 / S) * l_1] + [(TPM_2 / S) * l_2] + [(TPM_3 / S) * l_3]. I would be curious how this affects the correlations you're seeing under the TPM measure. I'd be more than willing to throw together a quick python / R script to compute these values, but I probably wouldn't be able to get to it until the end of the week.

      This being said, it would be interesting to look at the specific genes where these methods most disagree, to understand what is the main driver of differences (at least in this data).

      Delete
    3. Hi Rob, thanks for your insights. I am aware of the point you raised: I just borrowed the gene-level effective length from salmon when convert HTSeq counts to TPM, which is not optimal. I would like to try your script (please take your time) to calculate the gene-level effective length from the transcript level.

      In general, kallisto and salmon TPM correlates well, but I do see many genes salmon have relative high TPM while kallisto detected 0 TPM and vice versa. It would be interesting to see why. I will have updates once I dive into more details. Thanks!

      Delete
    4. Hi Ming,

      Since I'm still much more efficient in Python than R, I've put together the type of script I mention above here (https://gist.github.com/rob-p/fe3e5469a96b462fc4d092c97edd41bf). It takes as input 3 tsv files, one is the transcript-level quant from salmon, on is the file with gene names and counts, and one is a transcript to gene mapping (where the txp name is in the first column and the gene name is in the second). It computes the gene-level effective lengths as I describe above, and then augments the input gene-level count file with an "EffectiveLength" column and writes the output to a specified file.

      For the purposes of your analysis, it might be easiest to just convert my Python + Pandas to R, but this script should at least let you get an idea about how much of the discrepancy between methods is coming from the counts vs. the effective lengths. Also, I make some assumptions: First, the transcript-to-gene file has no headers; second, in the gene count tsv file, the header for the column with the gene names is 'Name' --- otherwise, it can contain any other columns you want. Let me know if you have any trouble running this.

      Delete
    5. Hi Rob, thanks for the script. I updated the post. effective gene length from the salmon gene level output indeed is different from the effective gene length derived from transcript level effective length.
      The TPM correlation looks better (closer to y=x line), but spearman correlation does not improve.

      Delete
  2. PS Thanks for this analysis it is really nice to compare

    ReplyDelete