The script was copied from http://genomespot.blogspot.com/2016/12/msigdb-gene-sets-for-mouse.html by Mark Ziemann.
save the below script to convert_gmt_perline.sh
:
This function translate the human gene symbol to mouse gene symbol for each line.
#!/bin/bash
# author: Mark Ziemann
# modified by: Ming Tang
set -eu
set -o pipefail
line=$1
NAME_DESC=`echo $line | cut -d ' ' -f-2`
GENES=`echo $line | cut -d ' ' -f3- \
| tr ' ' '\n' | sort -uk 1b,1 \
| join -1 1 -2 1 - \
<(cut -f3,5 mouse2hum_biomart_ens87.txt \
| sed 1d | awk '$1!="" && $2!=""' \
| sort -uk 1b,1) | cut -d ' ' -f2 \
| sort -u | tr '\n' '\t' \
| sed 's/\t$/\n/'`
echo $NAME_DESC $GENES | tr ' ' '\t'
Downloaded from the same site. https://gist.github.com/crazyhottommy/dde3c10cc367df715ef5c7c5f353b5f5
less -S mouse2hum_biomart_ens87.txt | head | csvtk csv2md -t
Gene ID | Transcript ID | Human associated gene name | Human gene stable ID | Associated Gene Name |
---|---|---|---|---|
ENSMUSG00000064336 | ENSMUST00000082387 | mt-Tf | ||
ENSMUSG00000064337 | ENSMUST00000082388 | mt-Rnr1 | ||
ENSMUSG00000064338 | ENSMUST00000082389 | mt-Tv | ||
ENSMUSG00000064339 | ENSMUST00000082390 | mt-Rnr2 | ||
ENSMUSG00000064340 | ENSMUST00000082391 | mt-Tl1 | ||
ENSMUSG00000064341 | ENSMUST00000082392 | MT-ND1 | ENSG00000198888 | mt-Nd1 |
ENSMUSG00000064342 | ENSMUST00000082393 | mt-Ti | ||
ENSMUSG00000064343 | ENSMUST00000082394 | mt-Tq | ||
ENSMUSG00000064344 | ENSMUST00000082395 | mt-Tm |
download from http://software.broadinstitute.org/gsea/msigdb/collections.jsp#H
less -S h.all.v6.1.symbols.gmt
cat h.all.v6.1.symbols.gmt | parallel -k -j 5 './convert_gmt_perline.sh {}' > h.all.v6.1.symbols_mouse.gmt
As I mentioned in my previous blog post, one can run GSEA in two modes:
- input the raw expression level of all samples.
- pre-rank the genes and use this ranked gene list for GSEA.
Even if you go with option 1, GSEA internally rank the gene list by some metrics. The default is signal to noise ratio: (uA - uB)/sigmaA + sigmaB
.
As GSEA was designed for array data, for RNAseq data, what value (RPKM, rlog counts, or TPM?) to input to GSEA is still not well understood. see Q&A from GSEA. They recommend using the pre-rank method for RNAseq data.
To generate a pre-ranked gene list(for RNAseq, go from raw counts to DESeq2):
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = countmat,
colData = coldata,
design = ~ condition)
dds
dds <- DESeq(dds)
## I have 20 samples for each condition: knockout gene X, and WT gene X
## KO vs WT
dds$condition <- factor(dds$condition, levels = c("KO","WT"))
res <- results(dds, contrast=c("condition", "KO", "WT"))
resultsNames(dds)
res$ENSEMBL<- rownames(res)
# I used ENSEMBL gene id for the matrix, and annotate with gene symbols.
res_anno<- as.data.frame(res) %>% left_join(gene_symbols, c("ENSEMBL" = "gene_id"))
res_pre_rank<- sign(res_anno$log2FoldChange) * -log10(res_anno$pvalue)
# you will need the gene symbol, and suffix the file with rnk for GSEA to recognize
write_tsv(data.frame(Name = res_anno$gene_name, metric = res_pre_rank), "KO_vs_WT_pre_rank.rnk")
Now, you can load the rnk
file into GSEA and do the analysis.
We rank the genes by sign of the fold change times the p-value (we told DESeq2 to compare KO vs WT, if a fold change is positive, that means the gene is up-regulated in KO), so the genes on the top (or left) is the genes with higher expression value in the KO
group, while the genes on the bottom (or right) are the genes with lower expression value in KO
group.
How to read the figure ?
The X-axis is all your genes in the expriment (~ 20,000 in this case) pre-ranked by your metric. each black bar is the gene in this gene set(pathway). You have an idea where are the genes located in the pre-ranked list.
Enrichement Score is calcuated by some metric that ES is positive if the gene set is located in the top of the pre-ranked gene list. ES is negative if the gene set is located in the bottom of the pre-ranked gene list.
We see glycolisis is positively co-related with KO
.
IL6_JAK_SATA3 signaling is positively co-related with WT
.
This is wonderful Ming, great write-up!
ReplyDeleteThanks Mark! many of those are from your site :)
DeleteReally great...........Nice and easy explanation. Thanks @ Tommy Tang
DeleteHow do you deal with non-one-to-one gene mapping?
ReplyDeleteOnce someone converts to become a customer, several factors can affect the after-sales relationship. Your customer service, communications, and order fulfillment will determine customer satisfaction levels. Ultimately, if you get it right, you can encourage more customers to come back, delivering higher Customer Lifetime Value (CLV) for your business.
ReplyDeletehttps://ppcexpo.com/blog/awareness-consideration-conversion
Murree's natural beauty, characterized by lush green hills, dense forests, and breathtaking mountain vistas.
ReplyDeleteLuxury Cottages in Murree