Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, July 19, 2016

several databases I want to share for RNA-seq data

Recently, there are several papers focusing on batch recomputing the public RNA-seq data sets either by local clusters or remote cloud computing. I will put the links below:

1.  A cloud-based workflow to quantify transcript-expression levels in public cancer compendia. one can download the TCGA and CCLE RNA-seq data in transcript-level  in counts and TPM units.

used gencode v24 as reference and kallisto to calculate.

2.  Rapid and efficient analysis of 20,000 RNA-seq samples with Toil.  Access the data here. Note that the data are mapped to hg38 reference genome and gencode v23 as annotation. TCGA/ICGC /GTEx and CCLE data.


3.  intropolis is a list of exon-exon junctions found across 21,504 human RNA-seq samples on the Sequence Read Archive (SRA) from spliced read alignment to hg19 with Rail-RNA.


4. RESTful RNA-seq Analysis API A simple RESTful API to access analysis results of all public RNAseq data for nearly 200 species in European Nucleotide Archive.


5. ExpressionAtlas bioconductor package: This package is for searching for datasets in EMBL-EBI Expression Atlas, and downloading them into R for further analysis. Each Expression Atlas dataset is represented as a SimpleList object with one element per platform. Sequencing data is contained in a SummarizedExperiment object, while microarray data is contained in an ExpressionSet or MAList object.


6. Bgee as suggested in the comment.

7. The Digital Expression Explorer The Digital Expression Explorer (DEE) is a repository of digital gene expression profiles mined from public RNA-seq data sets. These data are obtained from NCBI Short Read Archive.





Monday, July 11, 2016

Comparing salmon, kalliso and STAR-HTseq RNA-seq quantification

Why I am testing again?

However, I just got several RNA-seq data to play with, and I think it is a good time-point for me to get my hands wet on those RNA-seq quantification tools (especially those alignment-free ones) and get a personal idea of how different tools perform. I am not doing bench-marking, as one should simulate the RNA-seq reads by e.g. polyester to have the ground truth.

UPDATE 08/22/16, a nice post by Tom Smith compared alignment dependent and independent RNA-seq quantification using simulated data: why you should use alignment independent quantification for RNA-seq.
Choosing alignment based tools (such as tophat, STAR, bowtie, HISAT) or alignment free ones depends on the purpose of your study. Salmon and kallisto requires the reads "pesudo-map" to the transcriptome, so one has to provide a fasta file containing all the transcripts you want to quantify. Therefore, if you want to find novel transcripts, you probably should go with the alignment based methods. It is also shown recently that Widespread intron retention diversifies most cancer transcriptomes. If you want to do similar things, you need to use mappers with the genome (not the transcriptiome) as a reference.
kallisto can output a pseudo-bam which can be useful for some people. Salmon will have the same functionality in the next release according to Rob.
Now, let's begin my analysis.
I skipped quality trimming for this analysis. As it was shown that trimming may not necessary to be a good thing:
Trimming of sequence reads alters RNA-Seq gene expression estimates
For assembly On the optimal trimming of high-throughput mRNA sequence data.
However, as a rule of thumb, one needs to always check reads quality of any sequencing data sets. e.g. RSeQC: An RNA-seq Quality Control Package

Testing Salmon for RNA-seq quantification

Download the binary(v0.6.0) by:
wget https://github.com/COMBINE-lab/salmon/releases/download/v0.6.0/SalmonBeta-0.6.0_DebianSqueeze.tar.gz
tar xvzf SalmonBeta-0.6.0_DebianSqueeze.tar.gz
Read the documentation on how to use it here
To use Salmon in quasi-mapping-based mode, then you first have to build an Salmon index for your transcriptome.Ensemble release 75 is the latest version for GRCh37 (hg19). starting from release 76 it is GRCh38 (hg38)
Download the hg19 version of cDNA and non-coding RNA fasta:
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz  
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh37.75.ncrna.fa.gz
## merge together 

gunzip -c Homo_sapiens.GRCh37.75.cdna.all.fa.gz Homo_sapiens.GRCh37.75.ncrna.fa.gz > Homo_sapiens.GRCh37.75.cdna.ncrna.fa
First, you run the Salmon indexer:
Default index --- The quasi index has been made the default type. This means that it is no longer necessary to provide the --type option to the index command. The fmd index remains enabled, but may be removed in a future version.
salmon index -t Homo_sapiens.GRCh37.75.cdna.ncrna.fa -i Homo_sapiens.GRCh37.75_quasi_index 
It finished in a little over 10 mins.
quantify the transcript:
The RNA-seq library I am analyzing is single-end stranded, reads from reverse strand. so I specified -l SR. readsalmon doc for different library types. I especially like the figure representation below:
salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat 3R_S18_L002_R1_001.fastq.gz) -o 3R_transcripts_quant

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat 50R_S19_L002_R1_001.fastq.gz) -o 50R_transcripts_quant

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat WT_S17_L002_R1_001.fastq) -o WT_transcripts_quant
The quantification finishes within minutes!

quantification in gene level

Salmon can also give gene-level quantification as long as feed a gtf file. In addtion, I specify the fragment length to be 200 bp (default) and standard deviation of 20 (default 80). Details here. For paired-end RNA-seq, the fragment length distribution can be infered from the fastq files, but for single-end data, it needs to be specified.
wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat 3R_S18_L002_R1_001.fastq.gz) -o 3R_transcripts_quant -g ~/annotations/Homo_sapiens.GRCh37.75.gtf --fldMean 200 --fldSD 20

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat 50R_S19_L002_R1_001.fastq.gz) -o 50R_transcripts_quant -g ~/annotations/Homo_sapiens.GRCh37.75.gtf --fldMean 200 --fldSD 20

salmon quant -p 10 -i ~/annotations/Homo_sapiens.GRCh37.75_quasi_index -l SR -r <(zcat WT_S17_L002_R1_001.fastq.gz) -o WT_transcripts_quant -g ~/annotations/Homo_sapiens.GRCh37.75.gtf --fldMean 200 --fldSD 20
It is recommended using tximport to get the gene-level quantification. I asked the difference between tximport andsalmon quant -g. The developer of salmon @Rob Patro answered:
Main diffs I can think of (1) in R
(2) integrated with DESeq2
(3) Can derive multi-sample effective gene lengths

testing kallisto for quantification.

Build index first (I am using v0.43.0):
kallisto index -i Homo_sapiens.GRCh37.75.cdna.ncrna.kalisto.idx Homo_sapiens.GRCh37.75.cdna.ncrna.fa
It took me around 15 mins to build the index.
quantification:
-l, --fragment-length=DOUBLE Estimated average fragment length
-s, --sd=DOUBLE Estimated standard deviation of fragment length (default: value is estimated from the input data)
In the case of single-end reads, the -l option must be used to specify the average fragment length. Typical Illumina libraries produce fragment lengths ranging from 180–200 bp but it’s best to determine this from a library quantification with an instrument such as an Agilent Bioanalyzer.
see a question by James on the google group
Common values for single end reads are insert length 200 and sd 20. If you have any better information, like the person who prepped the library or better yet, data from bioanalyzer that will of course be better.
kallisto can take .gz files.
kallisto quant -t 10 -i ~/annotations/Homo_sapiens.GRCh37.75.cdna.ncrna.kalisto.idx -o 3R_kaliso_output --single -l 200 -s 20 3R_S18_L002_R1_001.fastq.gz

kallisto quant -t 10 -i ~/annotations/Homo_sapiens.GRCh37.75.cdna.ncrna.kalisto.idx -o 50R_kaliso_output --single -l 200 -s 20 50R_S19_L002_R1_001.fastq.gz

kallisto quant -t 10 -i ~/annotations/Homo_sapiens.GRCh37.75.cdna.ncrna.kalisto.idx -o WT_kaliso_output --single -l 200 -s 20 WT_S17_L002_R1_001.fastq.gz
Finished in ~6 mins. again, blazing fast as Salmon does.

counts versus TPM/RPKM/FPKM

I will need to quote from this blog post on explaning technical differences among RPKM, FPKM and TPM.
These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM:
  1. Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
  2. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
  3. Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.
FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq
TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:
  1. Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
  2. Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
  3. Divide the RPK values by the “per million” scaling factor. This gives you TPM.
So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.
When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.
countToTpm <- function(counts, effLen)
{
    rate <- log(counts) - log(effLen)
    denom <- log(sum(exp(rate)))
    exp(rate - denom + log(1e6))
}

countToFpkm <- function(counts, effLen)
{
    N <- sum(counts)
    exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}

fpkmToTpm <- function(fpkm)
{
    exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}

countToEffCounts <- function(counts, len, effLen)
{
    counts * (len / effLen)
}

################################################################################
# An example
################################################################################
cnts <- c(4250, 3300, 200, 1750, 50, 0)
lens <- c(900, 1020, 2000, 770, 3000, 1777)
countDf <- data.frame(count = cnts, length = lens)

# assume a mean(FLD) = 203.7
countDf$effLength <- countDf$length - 203.7 + 1
countDf$tpm <- with(countDf, countToTpm(count, effLength))
countDf$fpkm <- with(countDf, countToFpkm(count, effLength))
with(countDf, all.equal(tpm, fpkmToTpm(fpkm)))
countDf$effCounts <- with(countDf, countToEffCounts(count, length, effLength))
My colleage @samir processed the same data using STAR-HTseq piepline, and it gives the raw counts for each gene. He is very skepitcal about transcript level quantification and would focus on gene-level for now. What adds the complexity a bit is that he used gencode v19 as annotation and the gene name has a dot + digits in the end of each gene. e.g. ENSG00000000003.10 vs ENSG00000000003 in gtf files downloaded from ensemble. I will need to get rid of the digits in the end.
I will need to convert the raw counts from the STAR-HTseq pipeline to TPM for comparison as Salmon and kallistooutput TPM and estimated counts. Read the post: convert counts to TPMKamil Slowikowski wrote a function to convert counts to TPM, and the function involves an effective length of the features.

what is effective length of the feature?

It is quite mathematical, but the general idea is:
If we take the fragment length to be fixed, then the effective length is how many fragments can occur in the transcript. This turns out to be length - frag_len +1. The number of fragments coming from a transcript will be proportional to this number, regardless of whether you sequenced one or both ends of the fragment. In turn, when it comes to probabilistically assigning reads to transcripts the effective length plays a similar role again. Thus for short transcripts, there can be quite a difference between two fragment lengths. To go back to your example if you have transcript of length 310, your effective length is 10 (if fragment length is 300) or 160 (if fragment length is 150) in either case, which explains the discrepancy you see.
From @Rob
The effective length is computed by using the fragment length distribution to determine the effective number of positions that can be sampled on each transcript. You can think of this as convolving the fragment length distribution with the characteristic function (the function that simply takes a value of 1) over the transcript. For example if we observe fragments of length 50 --- 1000, a position more than 1000 bases from the end of the transcript will contribute a value of 1 to the effective length, while a position 150 bases will contribute a value of F(150), where F is the cumulative distribution function of the fragment length distribution. For single end data, where we can't learn an empirical FLD, we use a gaussian whose mean and standard deviation can be set with --fldMean and --fldSD respectively.
From Harold Pimentel's post above. He is in Lior Pachter's group.
Effective length refers to the number of possible start sites a feature could have generated a fragment of that particular length. In practice, the effective length is usually computed as: 
where uFDL is the mean of the fragment length distribution which was learned from the aligned read. If the abundance estimation method you’re using incorporates sequence bias modeling (such as eXpress or Cufflinks), the bias is often incorporated into the effective length by making the feature shorter or longer depending on the effect of the bias.

R scripts to convert HTseq counts to TPM

library(dplyr)
setwd("/Volumes/mdarisngc03.mdanderson.edu/scratch/RNA-seq-26357338/")

list.files()

counts.3R<- read.table("STAR_3R-30390482_htseq.cnt", sep="\t", header=F, stringsAsFactors = F)
counts.50R<- read.table("STAR_50R-30393469_htseq.cnt", sep="\t", header=F, stringsAsFactors = F)
counts.WT<- read.table("STAR_WT-30393468_htseq.cnt", sep="\t", header=F, stringsAsFactors = F)

counts.df<- cbind(counts.WT, counts.3R, counts.50R)
counts.df<- counts.df[, c(1,2,4,6)]
names(counts.df)<- c("gene_name", "HTseq.WT", "HTseq.3R", "HTseq.50R")

## get rid of the digits in the end of the gene_name
counts.df<- counts.df %>% mutate(gene_name = gsub("\\.[0-9]+", "", gene_name)) 
To convert the raw counts from HTSeq (gencode v19 as annotation), I will need the length of each gene. Although one can compute the gene length from the gtf files, the gene-level output of Salmon has already computed it for me. I will just need to "borrow" from there.
WT.Salmon.gene.quant<- read.table("./WT-30393468/WT_transcripts_quant/quant.genes.sf", sep="\t",
                                  header=T, stringsAsFactors = F)
gene.length<- dplyr::select(WT.Salmon.gene.quant, c(Name, Length))

counts.df<- inner_join(counts.df, gene.length, by =c("gene_name"="Name")) 
counts_to_tpm function from https://www.biostars.org/p/171766/
counts_to_tpm <- function(counts, featureLength, meanFragmentLength) {

  # Ensure valid arguments.
  stopifnot(length(featureLength) == nrow(counts))
  stopifnot(length(meanFragmentLength) == ncol(counts))

  # Compute effective lengths of features in each library.
  effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) {
    featureLength - meanFragmentLength[i] + 1
  }))

  # Exclude genes with length less than the mean fragment length.
  idx <- apply(effLen, 1, function(x) min(x) > 1)
  counts <- counts[idx,]
  effLen <- effLen[idx,]
  featureLength <- featureLength[idx]

  # Process one column at a time.
  tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
    rate = log(counts[,i]) - log(effLen[,i])
    denom = log(sum(exp(rate)))
    exp(rate - denom + log(1e6))
  }))

  # Copy the column names from the original matrix.
  colnames(tpm) <- colnames(counts)
  return(tpm)
}

featureLength<- counts.df$Length

## fragment length is 200bp 
meanFragmentLength<- c(200, 200, 200)

counts<- as.matrix(counts.df[,c(2,3,4)])
rownames(counts)<- counts.df$gene_name

TPM.from.HTSeq<- counts_to_tpm(counts, featureLength, meanFragmentLength)

Import salmon and kallisto transcript level

Use tximport to import salmon and kallisto transcript level quantification
#source("https://bioconductor.org/biocLite.R")
#biocLite("EnsDb.Hsapiens.v75")

#create a tx2gene.txt table
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

Tx.ensemble <- transcripts(edb,
          columns = c("tx_id", "gene_id", "gene_name"),
          return.type = "DataFrame")
nrow(Tx.ensemble)

tx2gene<- Tx.ensemble[,c(1,2)]

samples<- c("WT-30393468", "3R-30390482", "50R-30393469")
salmon.dir<- c("WT_transcripts_quant", "3R_transcripts_quant", "50R_transcripts_quant")
kallisto.dir<- c("WT_kaliso_output", "3R_kaliso_output", "50R_kaliso_output")

kallisto.files <- file.path(samples, kallisto.dir, "abundance.tsv")
names(kallisto.files)<- c("kallisto.WT", "kallisto.3R", "kallisto.50R")

salmon.files<- file.path(samples, salmon.dir, "quant.sf")
names(salmon.files)<- c("salmon.WT", "salmon.3R", "salmon.50R")

all(file.exists(kallisto.files))
all(file.exists(salmon.files))

library(tximport)
library(readr)

tx.kallisto <- tximport(kallisto.files, type = "kallisto", tx2gene = tx2gene, 
                        reader = read_tsv, countsFromAbundance = "lengthScaledTPM")
tx.salmon <- tximport(salmon.files, type = "salmon", tx2gene = tx2gene, 
                      reader = read_tsv, countsFromAbundance = "lengthScaledTPM")

compare STAR-HTseq, kallisto and salmon

## merge the counts table together from HTSeq, salmon and kallisto
head(counts.df)
salmon.counts<- tx.salmon$counts
salmon.counts<- as.data.frame(salmon.counts)
salmon.counts$gene_name<- rownames(salmon.counts)

kallisto.counts<- tx.kallisto$counts
kallisto.counts<- as.data.frame(kallisto.counts)
kallisto.counts$gene_name<- rownames(kallisto.counts)

HTSeq.salmon.counts<- inner_join(counts.df, salmon.counts)
HTSeq.salmon.kallisto.counts<- inner_join(HTSeq.salmon.counts, kallisto.counts) %>% dplyr::select(-Length)

### counts correlation for WT sample only

library(ggplot2)

ggplot(HTSeq.salmon.kallisto.counts,aes(x=log2(HTseq.WT+1), y=log2(salmon.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=15, y=20, label= "spearman cor = 0.86") + ggtitle("HTSeq counts versus salmon counts")

cor(log2(HTSeq.salmon.kallisto.counts$HTseq.WT+1), log2(HTSeq.salmon.kallisto.counts$salmon.WT+1), method="spearman")


ggplot(HTSeq.salmon.kallisto.counts,aes(x=log2(HTseq.WT+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=15, y=20, label= "spearman cor = 0.83") + ggtitle("HTSeq counts versus kallisto counts") 

cor(log2(HTSeq.salmon.kallisto.counts$HTseq.WT+1), log2(HTSeq.salmon.kallisto.counts$kallisto.WT+1), method="spearman")

ggplot(HTSeq.salmon.kallisto.counts,aes(x=log2(salmon.WT+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=15, y=20, label= "spearman cor = 0.94") + ggtitle("salmon counts versus kallisto counts") 

cor(log2(HTSeq.salmon.kallisto.counts$salmon.WT+1), log2(HTSeq.salmon.kallisto.counts$kallisto.WT+1), method="spearman")


### merge the TPM table together from HTSeq, salmon and kallisto
head(TPM.from.HTSeq)
TPM.from.HTSeq<- as.data.frame(TPM.from.HTSeq)
TPM.from.HTSeq$gene_name<- rownames(TPM.from.HTSeq)

salmon.TPM<- tx.salmon$abundance
salmon.TPM<- as.data.frame(salmon.TPM)
salmon.TPM$gene_name<- rownames(salmon.TPM)

kallisto.TPM<- tx.kallisto$abundance
kallisto.TPM<- as.data.frame(kallisto.TPM)
kallisto.TPM$gene_name<- rownames(kallisto.TPM)

HTSeq.salmon.TPM<- inner_join(TPM.from.HTSeq, salmon.TPM) 
## re-order the columns
HTSeq.salmon.kallisto.TPM<- inner_join(HTSeq.salmon.TPM, kallisto.TPM) %>% dplyr::select(c(4,1:3,5:10)) 

## plot TPM correlation
ggplot(HTSeq.salmon.kallisto.TPM,aes(x=log2(HTseq.WT+1), y=log2(salmon.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=12, y=15, label= "spearman cor = 0.84") + ggtitle("HTSeq TPM versus salmon TPM")

cor(log2(HTSeq.salmon.kallisto.TPM$HTseq.WT+1), log2(HTSeq.salmon.kallisto.TPM$salmon.WT+1), method="spearman")

ggplot(HTSeq.salmon.kallisto.TPM, aes(x=log2(HTseq.WT+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=12, y=15, label= "spearman cor = 0.83") + ggtitle("HTSeq TPM versus kallisto TPM")

cor(log2(HTSeq.salmon.kallisto.TPM$HTseq.WT+1), log2(HTSeq.salmon.kallisto.TPM$kallisto.WT+1), method="spearman")

ggplot(HTSeq.salmon.kallisto.TPM,aes(x=log2(salmon.WT+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=12, y=15, label= "spearman cor = 0.95") + ggtitle("salmon TPM versus kallisto TPM")

cor(log2(HTSeq.salmon.kallisto.TPM$salmon.WT+1), log2(HTSeq.salmon.kallisto.TPM$kallisto.WT+1), method="spearman")
6 5 4 3 2 1

Conclusions

Kallisto and Salmon seems to have very tight correlation. STAR-HTSeq based RNA-seq pipeline is a bit off when correlated with the other two. Note that I did not do adaptor and low qualit base trimming, STAR may discard some informative reads. In contrast, salmon (and kallisto?) is very robust to quality and adapter trimming. If you have other thoughs, please comment :) I am sorry that my analysis is not fully reproducible as I can not share the data now.
by Rob
One of the benefits of the quasi-mapping approach taken by sailfish is that it is rather robust to quality and adapter trimming. One way you can assess this is by looking at the mapping rate (i.e. the fraction of reads that are mapped) --- which appears in the comments of the main output file "quant.sf". You may be able to recover a small fraction of extra reads with quality / adapter trimming, but generally this is not necessary for Sailfish to map reads accurately for quantification purposes.

UPDATE 07/24/16 Robs suggested that "taking the aggregate feature length (i.e. the gene length) and then correcting it may not always yield the same result as correcting the feature length (i.e. the transcript lengths) and then aggregating them to the gene level." see comment below.
I then re-calculated TPM from counts using the method he suggested, and the correlation looks better, but spearman correlation does not improve.
get rid of the digits (gene version) in the end for the gene names (gencode v19)
cat STAR_WT-30393468_htseq.cnt| sed -E 's/\.[0-9]+//' > WT_htseq.cnt
transcript to gene mapping file:
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

Tx.ensemble <- transcripts(edb,
          columns = c("tx_id", "gene_id", "gene_name"),
          return.type = "DataFrame")
nrow(Tx.ensemble)

tx2gene<- Tx.ensemble[,c(1,2)]

write.table(tx2gene, "~/playground/gene-level-effective-length/tx2gene.tsv", col.names = F, row.names = F, sep="\t", quote=F)
add #! /bin/env python to the head of the python script Rob wrote.
#! /bin/env python
import argparse
import pandas as pd
import numpy as np
import sys

def main(args):
    gtable = pd.read_table(args.ginput).set_index('Name')
    ttable = pd.read_table(args.tinput).set_index('Name')
    tgmap = pd.read_table(args.tgmap, names=['t', 'g']).set_index('t')
    gene_lengths = {}
    j = 0

    # Map over all gene groups (a gene and its associated transcripts)
    for g, txps in tgmap.groupby('g').groups.iteritems():
        if j % 500 == 1:
            print("Processed {} genes".format(j))
        j += 1
        # The set of transcripts present in our salmon index
        tset = []
        for t in txps:
            if t in ttable.index:
                tset.append(t)
        # If at least one of the transcripts was present
        if len(tset) > 0:
            # The denominator is the sum of all TPMs
            totlen = ttable.loc[tset,'TPM'].sum()
            # Turn the relative TPMs into a proper partition of unity
            if totlen > 0:
                tpm_fracs = ttable.loc[tset, 'TPM'].values / ttable.loc[tset,'TPM'].sum()
            else:
                tpm_fracs = np.ones(len(tset)) / float(len(tset))
            # Compute the gene's effective length as the abundance-weight
            # sum of the transcript lengths
            elen = 0.0
            for i,t in enumerate(tset):
                elen += tpm_fracs[i] * ttable.loc[t, 'EffectiveLength']
            gene_lengths[g] = elen

    # Give the table an effective length field
    gtable['EffectiveLength'] = gtable.apply(lambda r : gene_lengths[r.name] if r.name in gene_lengths else 1.0, axis=1)
    # Write it to the output file
    gtable.to_csv(args.output, sep='\t', index_label='Name')

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Compute gene-level effective lengths")
    parser.add_argument('--ginput', type=str, help='gene level input table')
    parser.add_argument('--tinput', type=str, help='transcript level input table')
    parser.add_argument('--tgmap', type=str, help='transcript -> gene mapping')
    parser.add_argument('--output', type=str, help='output table with extra column')

    args = parser.parse_args()
    main(args)
chmod u + x glef.py

/glef.py --ginput WT_htseq.cnt --tinput WT_quant.sf --tgmap tx2gene.tsv --output WT_htseq_effective_length.tsv
compare effecitve length derived from transcript level and salmon output gene level
WT.count.df<- read.table("~/playground/gene-level-effective-length/WT_htseq_effective_length.tsv",
                         sep ="\t", header=T, stringsAsFactors = F)

head(counts.df)

counts.effective.df<- left_join(counts.df, WT.count.df, by=c("gene_name"="Name"))

## scatter plot of effective length from gene level output of salmon and effective length derived from transcript level salmon output

plot(log2(counts.effective.df$EffectiveLength.x+1), log2(counts.effective.df$EffectiveLength.y+1), xlab= "log2 effective gene length from salmon gene level output", ylab="log2 effective gene length derived from transcript level")

abline(a=0, b=1, col="red")
## convert to TPM use transcript derived effective gene length
WT_count<- as.matrix(counts.effective.df[,2])

count2Tpm <- function(counts, effLen)
{
    rate <- log(counts) - log(effLen)
    denom <- log(sum(exp(rate)))
    exp(rate - denom + log(1e6))
}

WT.TPM.from.HTSeq<- count2Tpm(WT_count, counts.effective.df$EffectiveLength.y)
WT.TPM.from.HTSeq<- cbind(as.data.frame(WT.TPM.from.HTSeq), gene_name=counts.effective.df$gene_name) 
colnames(WT.TPM.from.HTSeq)<- c("WT.htseq.TPM", "gene_name")

head(TPM.from.HTSeq)
dim(salmon.TPM)

WT.TPM.HTseq.salmon<- left_join(WT.TPM.from.HTSeq, salmon.TPM)
WT.TPM.HTseq.salmon.kallisto<- left_join(WT.TPM.HTseq.salmon, kallisto.TPM)

## HTSeq TPM versus salmon TPM
ggplot(WT.TPM.HTseq.salmon.kallisto,aes(x=log2(WT.htseq.TPM+1), y=log2(salmon.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=13, y=20, label= "spearman cor = 0.85") + ggtitle("HTseq TPM versus salmon TPM") 

cor(log2(WT.TPM.HTseq.salmon$WT.htseq.TPM+1), log2(WT.TPM.HTseq.salmon$salmon.WT+1), method="spearman")
## HTSeq TPM versus kallisto TPM

ggplot(WT.TPM.HTseq.salmon.kallisto,aes(x=log2(WT.htseq.TPM+1), y=log2(kallisto.WT+1))) + geom_point() + geom_smooth(method="lm") + geom_abline(slope=1, intercept = 0, color="red") + annotate("text", x=13, y=20, label= "spearman cor = 0.80") + ggtitle("HTseq TPM versus kallisto TPM") 

cor(log2(WT.TPM.HTseq.salmon.kallisto$WT.htseq.TPM+1), log2(WT.TPM.HTseq.salmon.kallisto$kallisto.WT+1), method="spearman")