Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, December 13, 2022

15 tools/papers for multi-sample multi-group single-cell RNAseq differential expression analysis

 1/  [An Empirical Bayes Method for Differential Expression Analysis of Single Cells with Deep Generative Models](https://www.biorxiv.org/content/10.1101/2022.05.27.493625v1) scVI-DE

2/  [muscat](http://www.bioconductor.org/packages/release/bioc/html/muscat.html)

3/  [Confronting false discoveries in single-cell differential expression](https://www.nature.com/articles/s41467-021-25960-2) "These observations suggest that, in practice, pseudobulk approaches provide an excellent trade-off between speed and accuracy for single-cell DE analysis." One needs to considder biolgoical replicates, pseduobulk works well.

4/  [Modelling group heteroscedasticity in single-cellRNA-seq pseudo-bulk data](https://www.biorxiv.org/content/10.1101/2022.09.12.507511v1)

5/  [BSDE: barycenter single-cell differential expression for case–control studies](https://academic.oup.com/bioinformatics/article/38/10/2765/6554192?login=false)

 6/ [distinct](http://www.bioconductor.org/packages/release/bioc/html/distinct.html) Both are from Mark Robinson group.

7/ [nebula](https://github.com/lhe17/nebula) https://www.biorxiv.org/content/biorxiv/early/2020/09/25/2020.09.24.311662.full.pdf

8/  [Fast identification of differential distributions in single-cell RNA-sequencing data with waddR](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab226/6207964) https://github.com/goncalves-lab/waddR

9/ [CoCoA-diff: counterfactual inference for single-cell gene expression analysis](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02438-4)

10/ [Bias, robustness and scalability in single-cell differential expression analysis](https://www.nature.com/articles/nmeth.4612) From Mark Robinson group.

11/ [Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2599-6) "We observed that current methods designed for scRNAseq data do not tend to show better performance compared to methods designed for bulk RNAseq data."

12/  [Tree-based Correlation Screen and Visualization for Exploring Phenotype-Cell Type Association in Multiple Sample Single-Cell RNA-Sequencing Experiments](https://www.biorxiv.org/content/10.1101/2021.10.27.466024v1) TreeCorTreat is an open source R package that tackles this problem by using a tree-based correlation screen to analyze and visualize the association between phenotype and transcriptomic features and cell types at multiple cell type resolution levels.

13/ [Quantifying the effect of experimental perturbations in single-cell RNA-sequencing data using graph signal processing](https://www.biorxiv.org/content/10.1101/532846v3) read this thread https://twitter.com/krishnaswamylab/status/1328876444810960896?s=27

14/  [Causal identification of single-cell experimental perturbation effects with CINEMA-OT](https://www.biorxiv.org/content/10.1101/2022.07.31.502173v1)

github https://github.com/vandijklab/CINEMA-OT

15/ [IDEAS: individual level differential expression analysis for single-cell RNA-seq data](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02605-1)

Sunday, December 11, 2022

32 resources for (to-be) faculty on salary negotiation, grant writing, funding, and lab management

1/ Tips for negotiating salary and startup for newly-hired tenure-track faculty](https://dynamicecology.wordpress.com/2017/03/01/tips-for-negotiating-salary-and-startup-for-newly-hired-tenure-track-faculty/)

2/  [Creating accessibility in academic negotiations](https://www.sciencedirect.com/science/article/pii/S0968000422002870?dgcid=authord)

3/ [Ten Simple Rules to becoming a principal investigator](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007448)

4/  [applying for a faculty position](http://effortreport.libsyn.com/15-applying-for-a-faculty-position) by Roger Peng.

5/ [A list of publicly available grant proposals in the biological sciences](https://jabberwocky.weecology.org/2012/08/10/a-list-of-publicly-available-grant-proposals-in-the-biological-sciences/)

6/ [open grant](https://www.ogrants.org/) find other people's grants.

7/  [Early Career Funding, Awards, and Other Funding](https://docs.google.com/spreadsheets/d/1H1aj--VUYr7eMFk_T7x0Oh985LqbyyscXg2wAAevDnU/edit#gid=0) 

8/  https://ecrcentral.org/resources

9/  [Funding schemes for postdoctoral fellowships](https://asntech.github.io/postdoc-funding-schemes/)

10/  [Postdoctoral Funding Opportunities by Johns Hopkins](https://research.jhu.edu/rdt/funding-opportunities/postdoctoral/)

11/  [Early Career Funding Opportunities by Johns Hopkins](https://research.jhu.edu/rdt/funding-opportunities/early-career/)

12/   [The CommKit](http://mitcommlab.mit.edu/broad/use-the-commkit/) is a collection of guides to successful communication in the biological sciences, written by the BRCL Fellows.

13/  [writing in sciences stanford online course](https://www.coursera.org/learn/sciwrite/)

14 / [Ten simple rules for structuring papers](http://www.biorxiv.org/content/early/2017/05/23/088278)

15/  [NIH grant podcasts](https://grants.nih.gov/news/virtual-learning/podcasts.htm)

16/  [NIC guide](https://www.niaid.nih.gov/grants-contracts/write-research-plan)

17/ [Thoughts on reviewing NIH proposals: What is the difference between a 2.0 and 3.0 in initial score?](http://mistressoftheanimals.scientopia.org/2018/02/10/thoughts-on-reviewing-nih-proposals-what-is-the-difference-between-a-2-0-and-3-0-in-initial-score/) a blog post.

18/  [how to write a K99](https://k99.sbamin.com/) by Samir Amin (my good buddy). Go and check out this treasure.

19/  [seeking the k99](https://timoast.github.io/blog/seeking-the-k99/) a blog post by Tim Stuart.

20/  [AuthorArranger: Conquer journal title pages in seconds](https://authorarranger.nci.nih.gov/#/)

21/  [typeset](https://www.typeset.io/) The quickest way to read and understand scientific literature

22/  [cocites](http://www.cocites.com/)

23/  [connected papers](https://www.connectedpapers.com/)

24/  [ZoteroBib](https://zbib.org/) is a free service that helps you quickly create a bibliography in any citation style.

25/  [How to craft a figure legend for scientific papers](https://blog.bioturing.com/2018/05/10/how-to-craft-a-figure-legend-for-scientific-papers/) 

26/  [Ten quick tips for making things findable](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008469)

27/  [Making experimental data tables in the life sciences more FAIR: a pragmatic approach](https://academic.oup.com/gigascience/article/9/12/giaa144/6034785)

28/  protocols: https://www.protocols.io/

29/  [electronic lab notebooks review by Harvard HMS](https://datamanagement.hms.harvard.edu/electronic-lab-notebooks)

30/ [Rspace](https://www.researchspace.com/) Next-gen Elab notebook.

31/  [How to grow a healthy lab](https://www.nature.com/collections/pmlcrkkyyq)  Nature collections

32/  [Bench Sci](https://www.benchsci.com/) Run Successful Experiments with the Right Antibody. Let our AI decode the literature to provide antibody usage data that's unbiased and experiment-specific

Wednesday, December 7, 2022

23 tools to work with (single-cell) TCR/BCR-seq immune repertoire data

1/  [immunarch](https://immunarch.com/index.html) 

2/ [scRepertoire](https://github.com/ncborcherding/scRepertoire) 

3/ [dandelion](https://sc-dandelion.readthedocs.io/en/latest/)  python package for analyzing single cell BCR/TCR data from 10x Genomics 5’ solution! 

4/ [TRUST4](https://www.nature.com/articles/s41592-021-01142-2) developed in Shirley Liu's group. Use it to extract TCR/BCR information from bulk RNAseq or 5' scRNAseq data.

5/  a dramatic speedup for one of the core computations for adaptive immune receptor repertoire (AIRR) analysis - the discovery and counting of receptors that overlap between repertoires! Check out  [CompAIRR](https://github.com/uio-bmi/compairr). With 10^4 repertoires of 10^5 sequences each, CompAIRR ran in 17 minutes while the fastest existing tool took 10 days, amounting to a ~1000x speedup

6/ [ClusTCR](https://svalkiers.github.io/clusTCR/): a Python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity;

7/ [GLIPH2](https://www.nature.com/articles/s41587-020-0505-4)

8/  [GIANA allows computationally-efficient TCR clustering and multi-disease repertoire classification by isometric transformation](https://www.nature.com/articles/s41467-021-25006-7) from Bo Li.

9/  [tcrdist3](https://github.com/kmayerb/tcrdist3) is a python API-enabled toolkit for analyzing T-cell receptor repertoires

10/ [TCRex](https://tcrex.biodatamining.be/): a web tool for the prediction of TCR–epitope recognition

11/  [ImRex](https://github.com/pmoris/ImRex) TCR-epitope recognition prediction using combined sequence input represention for convolutional neural networks.

12/  [NetTCR - 2.0](https://services.healthtech.dtu.dk/service.php?NetTCR-2.0) Sequence-based prediction of peptide-TCR binding

13/  [CellaRepertorium](https://github.com/amcdavid/CellaRepertorium)

14/  [enclone](https://10xgenomics.github.io/enclone/) from 10x. we should give this a try if we want to cluster TCR and BCR clonotypes.

15/  [migec](https://github.com/mikessh/migec):A RepSeq processing swiss-knife.

16/  [MiXCR](https://github.com/milaboratory/mixcr) is a universal software for fast and accurate analysis of T- and B- cell receptor repertoire sequencing data.

17/ [ImReP](https://sergheimangul.wordpress.com/imrep/) is a computational method for rapid and accurate profiling of the adaptive immune repertoire from regular RNA-Seq data.

18/ [TcellMatch](https://github.com/theislab/tcellmatch): Predicting T-cell to epitope specificity. cellMatch is a collection of models to predict antigen specificity of **single T cells** based on CDR3 sequences and other single cell modalities, such as RNA counts and surface protein counts

19/ [scirpy](https://github.com/icbi-lab/scirpy): A scanpy extension for single-cell TCR analysis. 

20/  [Tessa](https://github.com/jcao89757/tessa) is a Bayesian model to integrate T cell receptor (TCR) sequence profiling with transcriptomes of T cells. Enabled by the recently developed single cell sequencing techniques, which provide both TCR sequences and RNA sequences of each T cell concurrently, Tessa maps the functional landscape of the TCR repertoire, and generates insights into understanding human immune response to diseases. 

21/ [DeepTCR](https://github.com/sidhomj/DeepTCR) Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data

https://twitter.com/John_Will_I_Am/status/1570837756787691527

https://www.science.org/doi/10.1126/sciadv.abq5089

22/  [Integrating T cell receptor sequences and transcriptional profiles by clonotype neighbor graph analysis (CoNGA)](https://www.nature.com/articles/s41587-021-00989-2)

23/ [Echidna: Integrated simulations of single-cell immune receptor repertoires and transcriptomes](https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbac062/6687122?login=false)

Tuesday, November 29, 2022

7 links to deeply understand heatmap

Making a heatmap is an essential skill for a bioinformatician. Just check how many figures are heatmap or heatmap variants in the genomics or single cell paper.

But you probably do not understand heatmap. 7 reading resources to understand heatmap!

1/  Mapping quantitative data to color https://www.nature.com/articles/nmeth.2134 

2/  Heat map from Nature Method column  https://www.nature.com/articles/nmeth.1902

3/  A tale of two heatmap functions https://rpubs.com/crazyhottommy/a-tale-of-two-heatmap-functions An old post by me.

4/  Heatmap demystified  https://rpubs.com/crazyhottommy/heatmap_demystified yet another post by me

5/  understand color mapping is key https://jokergoo.github.io/ComplexHeatmap-reference/book/a-single-heatmap.html#colors

6/ understand rastering  https://jokergoo.github.io/2020/06/30/rasterization-in-complexheatmap/

7/  what happens when you have a huge matrix 20,000 rows/genes  x 50 columns to plot?  https://gdevailly.netlify.app/post/plotting-big-matrices-in-r/


I learned so much from Zuguang Gu, thanks for his awesome Complexheatmap package https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html . it is my go-to tool for making heatmaps.

Monday, November 28, 2022

6 training resources for data management


* Best Practices for Biomedical Research Data Management https://learn.canvas.net/courses/1854

* Research Data Management Librarian Academy (https://rdmla.github.io/)

* DataONE Data Management Skillbuilding Hub  (https://dataoneorg.github.io/Education)

* Data Management Training Clearinghouse (https://dmtclearinghouse.esipfed.org/)

* Research data management open training materials Zenodo Community (https://zenodo.org/communities/dcc-rdm-training-materials)

* Consortium of European Social Science Data Archives (CESSDA) Training Resources (https://www.cessda.eu/Training-Resources)

Bonus:

Learn from TCGA # Collaborative Genomics Projects: A Comprehensive Guide https://www.sciencedirect.com/book/9780128021439/collaborative-genomics-projects-a-comprehensive-guide

Sunday, November 27, 2022

8 R/command line tools to deal with excel, tsv and csv files

 R packages:

* [readxl](https://readxl.tidyverse.org/)

* [tidyxl](https://github.com/nacnudus/tidyxl)

* [janitor](https://github.com/sfirke/janitor)


command line tools:

* [VisiData](https://www.visidata.org/) is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

* [csvkit](https://csvkit.readthedocs.io/en/latest/index.html#)

* [csvtk](https://bioinf.shenwei.me/csvtk/usage/) a cross-platform, efficient and practical CSV/TSV toolkit.

* [Miller](https://miller.readthedocs.io/en/latest/) is a command-line tool for querying, shaping, and reformatting data files in various formats including CSV, TSV, JSON, and JSON Lines.

* [eBay's TSV Utilities](https://opensource.ebay.com/tsv-utils/)

Tuesday, November 15, 2022

8 Resources to study Transcription factor binding, enhancers and histone modification distribution

 1. ENCODE https://www.encodeproject.org/

2. The International Human Epigenome Consortium (IHEC) epigenome data portal http://epigenomesportal.ca/ihec/index.html?as=1

3. Blueprint epigenome http://dcc.blueprint-epigenome.eu/#/home

4. EpiFactors http://epifactors.autosome.ru/ is a database for epigenetic factors, corresponding genes and products.

5. CistromeDB http://cistrome.org/db/#/ by Shirley Liu group

6. Remap https://remap2022.univ-amu.fr/ is a large scale integrative analysis of DNA-binding experiments for Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana transcriptional regulators.

7. ChIP-Atlas http://chip-atlas.org/  An integrative, comprehensive database to explore public Epigenetic dataset, including ChIP-Seq, DNase-Seq, ATAC-Seq, and Bisulfite-Seq data: ChIP-Atlas covers almost all public data archived in Sequence Read Archive of NCBI, EBI, and DDBJ with over 224,000 experiments.

8. Fantom5 https://fantom.gsc.riken.jp/5/

Sunday, November 13, 2022

7 Books for you to learn bioinformatics

1.  Data Analysis for the Life Sciences https://leanpub.com/dataanalysisforthelifesciences You can get it for free!

2. practical computing for biologist https://practicalcomputing.org/ My first ever book to start learning computational biology.

3. A Primer for Computational Biology https://open.oregonstate.education/computationalbiology/

4. Computational Genomics with R  http://compgenomr.github.io/book/

5. The Biologist’s Guide to Computing https://book.biologistsguide2computing.com/en/stable

6. Bioinformatics Data Skills https://www.oreilly.com/library/view/bioinformatics-data-skills/9781449367480/ A must read to upgrade your bioinformatics skills once you know the basics.

7. Bioinformatics Workbook: A tutorial to help scientists design their projects and analyze their data. https://bioinformaticsworkbook.org/#gsc.tab=0

Thursday, November 10, 2022

7 FREE Books to learn data science

1. Data science: A first introduction https://datasciencebook.ca/

2. Introduction to Data Science http://rafalab.dfci.harvard.edu/dsbook/

3. Agile Data Science with R https://edwinth.github.io/ADSwR/index.html

4. Tidy Modeling with R https://www.tmwr.org/

5. Feature Engineering and Selection: A Practical Approach for Predictive Models https://bookdown.org/max/FES/

6. Another Book on Data Science https://www.anotherbookondatascience.com/ compare R and python side by side

7. Research Software Engineering with Python https://merely-useful.tech/py-rse/

Wednesday, November 9, 2022

12 resources to bookmark for reproducible computational research

1. a reproducible workflow. https://www.youtube.com/watch?v=s3JldKoA0zw This two minute video will change your mind on reproducible research 

2. Parallel sequencing lives, or what makes large sequencing projects successful https://academic.oup.com/gigascience/article/6/11/gix100/4557140?login=false

3. Common-sense approaches to sharing tabular data alongside publication https://www.sciencedirect.com/science/article/pii/S2666389921002300

4. A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker https://psyarxiv.com/8xzqy/

5. Practical Computational Reproducibility in the Life Sciences https://www.cell.com/cell-systems/fulltext/S2405-4712(18)30140-6

6. A video by Dr.Keith A. Baggerly from MD Anderson [The Importance of Reproducible Research in High-Throughput Biology](https://www.youtube.com/watch?v=7gYIs7uYbMo) highly recommended.

7. Ten Simple Rules for Reproducible Computational Research http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285)

8. Good Enough Practices in Scientific Computing http://arxiv.org/abs/1609.00037 

9. Best Practices for Scientific Computing https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

10. A Quick Guide to Organizing Computational Biology Projects http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.100042  A must read for computational biologists!

11. Reproducibility of computational workflows is automated using continuous analysis https://www.nature.com/articles/nbt.3780

12. Five selfish reasons to work reproducibly https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Monday, November 7, 2022

9 tools for interactive exploring single-cell RNAseq data

1. cellxgene https://github.com/chanzuckerberg/cellxgene

2. cellar https://github.com/euxhenh/cellar

3. scSVA: an interactive tool for big data visualization and exploration in single-cell omics https://www.biorxiv.org/content/10.1101/512582v1

4. ASAP: a web-based platform for the analysis and interactive visualization of single-cell RNA-seq data https://academic.oup.com/bioinformatics/article/33/19/3123/3852081?login=false

5. [iSEE](https://bioconductor.org/packages/release/bioc/html/iSEE.html) Provides functions for creating an interactive Shiny-based graphical user interface for exploring data stored in SummarizedExperiment objects, including row- and column-level metadata

6. [VISION](https://github.com/YosefLab/VISION) A high-throughput and unbiased module for interpreting scRNA-seq data.

7. [DISCO](http://immunesinglecell.org/): Deep Integration of Single-Cell Omics. Want to visual millions of cell online and annotate cell type automatically? Try it!!! Make single cell easier and make life easier!

8. [TISCH](http://tisch.comp-genomics.org/) Tumor Immune Single-cell Hub (TISCH) is a scRNA-seq database focusing on tumor microenvironment (TME).

9. [CancerSCEM](https://ngdc.cncb.ac.cn/cancerscem) To date, CancerSCE version 1.0 consists of 208 cancer samples across 28 studies and 20 human cancer types

8 links to BETTER understand principal component analysis (PCA)

9 links to BETTER understand principal component analysis (PCA):

1. https://divingintogeneticsandgenomics.rbind.io/post/pca-in-action/  PCA in action, my blog post to calculate SVD and PCA with #rstats 

2. https://www.youtube.com/watch?v=rYz83XPxiZo MIT 1806 linear algebra  on SVD

3. https://peterbloem.nl/blog/pca-4 THE SINGULAR VALUE DECOMPOSITION (SVD)

4. http://rafalab.github.io/pages/harvardx.html High Dimension data analysis, week 2. 

5. https://towardsdatascience.com/why-pca-looks-triangular-a642daac721a why PCA looks triangular. 

6. https://www.nxn.se/valent/2017/6/12/how-to-read-pca-plots How to read PCA plots for single-cell data.

7. https://twitter.com/AedinCulhane/status/1007110262187544577 PCA horseshoe artifact

8. https://www.youtube.com/watch?v=_UVHneBUBW0  by Josh Starmer

Thursday, November 3, 2022

5 tools to visualize genomic datasets

 1. Karyoploter https://bernatgel.github.io/karyoploter_tutorial/Tutorial/PlotCoverage/PlotCoverage.html I used that to plot single-cell ATACseq tracks https://github.com/crazyhottommy/scATACutils/#plot-atacseq-tracks-for-each-cluster-of-cells, more examples https://rpubs.com/crazyhottommy/scATAC_tracks

2. plotgardener is a genomic data visualization package for R. Using `grid` graphics, `plotgardener` empowers users to programmatically and flexibly generate multi-panel figures 

https://github.com/PhanstielLab/plotgardener 

3. The goal of **g(r)osling** https://github.com/gosling-lang/grosling is to help you build interactive genomics visualizations with [Gosling](https://github.com/gosling-lang/gosling.js). This package uses [reticulate](https://rstudio.github.io/reticulate/) to provide an interface to the [Gos](https://github.com/gosling-lang/gos) Python package. https://github.com/gosling-lang/grosling

4.  Intervene: a tool for intersection and visualization of multiple gene or genomic region sets 

 https://bitbucket.org/CBGR/intervene/src/master/

 5. https://42basepairs.com/ saw it yesterday by @RobAboukhalil

Wednesday, November 2, 2022

8 links to bookmark for better data visualization

 Data visualization is a critical step in data analysis, 8 links to bookmark for better data visualization :

1. Nature Methods point of view data visualization  http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html the columns on color mapping and heatmap are very nice.

2. Ten simple rules to colorize biological data visualization https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008259

3. data visualization resources https://sabahzero.github.io/dataviz/resources

4. Fundamentals of Data Visualization https://clauswilke.com/dataviz/ 

5. Data Visualization https://socviz.co/  by Kieran Healy. I've read book and 4 and 5.

6. [R Graphics Cookbook](http://www.cookbook-r.com/Graphs/) by Winston Chang.

7. [ggplot2: Elegant Graphics for Data Analysis](https://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403) by Hadely Wickham.

8. https://www.data-to-viz.com/ help you to choose the right chart

Tuesday, November 1, 2022

6 links on workflow to make your life easier

 Bioinformatics analysis involves a lot of steps, 6 links on workflow to make your life easier:

1. over hundreds of workflow tools and engines https://github.com/pditommaso/awesome-pipeline 

2. see also from the CWL wiki https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems

3. A review of bioinformatic pipeline frameworks https://academic.oup.com/bib/article/18/3/530/2562749

4. discussion on biostars https://www.biostars.org/p/115745/

5. two papers by Titus Brown [Ten simple rules and a template for creating workflows-as-applications](https://osf.io/preprints/8w5j3/)

6.  Streamlining Data-Intensive Biology With Workflow Systems https://dib-lab.github.io/2020-workflows-paper/

Friday, October 28, 2022

16 resources for re-analyzing public expression data.

1.  https://rnama.com/docs/search-evaluation  RNA meta Analysis has ~26,700 studies (5,717 RNA-Seq and 20,955 Microarray)

2.  [refine.bio](https://www.refine.bio/) will have harmonized over 60,000 gene expression experiments

3.  BioJupies https://maayanlab.cloud/biojupies/

4.  [Recount2-FANTOM](https://www.biorxiv.org/content/10.1101/659490v1) Recounting the FANTOM Cage Associated Transcriptome. Long non-coding RNAs.

5.  Recount3 https://rna.recount.bio/

6.  [dee2](http://dee2.io/) Digital Expression Explorer 2. Digital Expression Explorer 2 (DEE2) is a repository of uniformly processed RNA-seq data mined from public data obtained from NCBI Short Read Archive. By Ziemann Mark et.al! Version 2 of dee.

7.  Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive https://www.biorxiv.org/content/10.1101/386441v1?rss=1

8.   [MetaSRA: normalized sample-specific metadata for the Sequence Read Archive](http://biorxiv.org/content/early/2016/11/30/090506)

9.   [ARCHS4: Massive Mining of Publicly Available RNA-seq Data from Human and Mouse](https://amp.pharm.mssm.edu/archs4/) ARCHS4 provides access to gene counts from HiSeq 2000, HiSeq 2500 and NextSeq 500 platforms for human and mouse experiments from GEO and SRA.

10.  [DEP-reads: Uniformlly processed public RNA-Seq data](http://bioinformatics.sdstate.edu/reads/) Read counts data for 5,470 human and mouse datasets from ARCHS4 v6 and 12,670 datasets from DEE2 for 9 model organisms by steven Ge.

11.  [SRA-explorer](https://ewels.github.io/sra-explorer/) This tool aims to make datasets within the Sequence Read Archive more accessible. 

12.  [intropolis](https://github.com/nellore/intropolis) is a list of exon-exon junctions found across **21,504** human RNA-seq samples on the Sequence Read Archive (SRA) from spliced read alignment to hg19 with Rail-RNA.

13.   [batch recompute ~20,000 RNA-seq samples from larget sequencing project such as TCGA, TARGET and GETEX](https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?host=https://toil.xenahubs.net). Used `hg38` and `gencode v21` as annotation.

14.   [A cloud-based workflow to quantify transcript-expression levels in public cancer compendia](http://biorxiv.org/content/early/2016/07/12/063552) used kallisto for TCGA/CCLE datasets and gencode v24 as annotation.

15.   [MiPanda](http://www.mipanda.org/) is an online resource for the interrogation and visualization of gene expression data from the myriad of publicly available cancer and normal next generation sequencing datasets.

16.   [Curation of over 10,000 transcriptomic studies to enable data reuse](https://www.biorxiv.org/content/10.1101/2020.07.13.201442v1)

Tuesday, October 25, 2022

10 courses to get you started with bioinformatics

1/ http://rafalab.dfci.harvard.edu/pages/harvardx.html by Rafa

2/ https://github.com/quinlan-lab/applied-computational-genomics#course-lecture-slides 

by Aaron Quinlan, the creator of bedtools and many other cool tools.


3/ https://www.bioinformaticsalgorithms.org/ You can find the video classes on Coursera 


4/ http://www.personal.psu.edu/iua1/courses/2014-BMMB-852.html by Istvan Albert, the creator of [biostars](https://www.biostars.org/).


5/  Introduction to Bioinformatics and Computational Biology https://liulab-dfci.github.io/bioinfo-combio/ by @XShirleyLiu 

glad to contribute a little myself.


6/ data carpentry workshops  https://datacarpentry.org/lessons/#genomics-workshop I am honored to serve as the curriculum committee chair 


7/ Computational Genomics: Applied Comparative Genomics https://github.com/schatzlab/appliedgenomics2018

8/ Introduction to Computational Biology https://biodatascience.github.io/compbio/  by Mike Love  @mikelove


9/ [MIT Computational Biology: Genomes, Networks, Evolution, Health - Fall 2018 - 6.047/6.878/HST.507](https://www.youtube.com/playlist?list=PLypiXJdtIca6GBQwDTo4bIEDV8F4RcAgt) by Manolis Kellis


10/ An introduction to Applied Bioinformatics http://readiab.org/introduction.html Very nice book with python code.

Sunday, October 23, 2022

5 websites to analyze GEO RNAseq data without a single line of code


4. GREIN : GEO RNA-seq experiments interactive navigator for re-analyzing GEO RNA-seq data https://hub.docker.com/r/ucbd2k/grein/

5. ImaGEO: Integrative Meta-Analysis of GEO Data https://imageo.genyo.es/

Bonus https://www.ebi.ac.uk/gxa/home more than GEO
one more Gemma https://gemma.msl.ubc.ca/home.html

Thursday, October 20, 2022

12 websites to learn computation and many others!

 1/  coursera https://www.coursera.org/ The first website I used. I took a data science Specialization https://www.coursera.org/specializations/jhu-data-science  and https://www.coursera.org/learn/bioinformatics

3/ udactiy https://www.udacity.com/ I took R courses, ggplot2,github and intro to ML 

4/ udemy https://www.udemy.com/ I took several python courses there.

5/ MIT opencourseware https://ocw.mit.edu/  1806 linear algebra and many others!

7/ youtube channel 3blue1brown https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw blow you away with cristal clear explanations. I watched the linear algebra series 

10/  Hubspot https://academy.hubspot.com/ this one is new to me

11/ EBML-EBI training https://www.ebi.ac.uk/training/online/ bioinformatics courses

12/  skillup https://www.simplilearn.com/skillup-free-online-courses this is new to me as well

Wednesday, October 19, 2022

12 web tools to explore genomics data


1. cbioportal https://cbioportal.org explore genomic datasets at the tips of your fingers
2. xena https://xena.ucsc.edu, a UCSC effort. Everyone needs to learn how to use UCSC genome browser https://genome.ucsc.edu
3. depmap portal https://lnkd.in/et3uDeci Cancer Cell Line Encyclopedia
4. TCGA RNA fusion portal https://tumorfusions.org
5. https://lnkd.in/e3P7td-w
6. Tumor Immune Syngeneic MOuse (TISMO) database http://tismo.cistrome.org
7. PDX models https://lnkd.in/ezby9kns
8. https://lnkd.in/ev6EUkwf Tumor Immune Dysfunction and Exclusion
9. http://timer.cistrome.org TIMER is a comprehensive resource for systematical analysis of immune infiltrates across diverse cancer types
10. genePattern https://genepattern.org
11. https://lnkd.in/ekFBdjfW
12. draw mutation for a protein https://lnkd.in/em8GmHQM

Friday, July 29, 2022

How to make a transcript to gene mapping file

 I need a transcript to gene mapping file for Salmon. I am aware of annotation bioconductor packages that can do this job. However, I was working on a species which does not have the annotation in a package format (I am going to use Drosphila as an example for this blog post). I had to go and got the gtf file and made such a file from scratch.

Please read the specifications of those two file formats.

Download drosophila gtf file from ENSEMBLE and gff file from NCBI

Find the gff file at https://www.ncbi.nlm.nih.gov/genome/?term=drosophila+melanogaster
Find the gtf file at ftp://ftp.ensembl.org/pub/release-95/gtf/drosophila_melanogaster/

#gtf file
zless -S ~/Downloads/Drosophila_melanogaster.BDGP6.95.gtf.gz | grep -v "#" | cut -f3 | sort | uniq -c
## 160859 CDS
##    4 Selenocysteine
## 187373 exon
## 46299 five_prime_utr
## 17737 gene
## 30492 start_codon
## 33892 three_prime_utr
## 34767 transcript
#gff file
zless -S ~/Downloads/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff.gz| grep -v "#" | cut -f3 | sort | uniq -c
## 160949 CDS
##    1 RNase_MRP_RNA
##    2 RNase_P_RNA
##    2 SRP_RNA
##  584 antisense_RNA
## 187809 exon
## 17421 gene
## 2275 lnc_RNA
## 30480 mRNA
##  479 miRNA
## 5416 mobile_genetic_element
##   77 ncRNA
##  263 primary_transcript
##  308 pseudogene
##  134 rRNA
## 1870 region
##    1 sequence_feature
##   32 snRNA
##  289 snoRNA
##  319 tRNA

Use unix command to make a transcripts to gene mapping file from gtf file

We see the feature types are quite different although they are both annotation files for the same species. The gtf file is relatively well formatted, and we can make a transcripts to gene mapping file easily using unix command line.

zless -S ~/Downloads/Drosophila_melanogaster.BDGP6.95.gtf.gz | grep -v "#" | awk '$3=="transcript"' | cut -f9 | tr -s ";" " " | awk '{print$4"\t"$2}' | sort | uniq |  sed 's/\"//g' | tee tx2gene_ensemble.tsv| head
## FBgn0013687  FBgn0013687
## FBtr0005088  FBgn0260439
## FBtr0006151  FBgn0000056
## FBtr0070000  FBgn0031081
## FBtr0070001  FBgn0052826
## FBtr0070002  FBgn0031085
## FBtr0070003  FBgn0062565
## FBtr0070006  FBgn0031089
## FBtr0070007  FBgn0031092
## FBtr0070008  FBgn0031094

hmm…why the first line has both genes in the two columns?… sanity check:

zless -S ~/Downloads/Drosophila_melanogaster.BDGP6.95.gtf.gz | grep "FBgn0013687" | less -S
## mitochondrion_genome FlyBase gene    14917   19524   .   +   .   gene_id "FBgn0013687"; gene_name "mt:ori"; gene_source "FlyBase"; gene_biotype "pseudogene";
## mitochondrion_genome FlyBase transcript  14917   19524   .   +   .   gene_id "FBgn0013687"; transcript_id "FBgn0013687"; gene_name "mt:ori"; gene_source "FlyBase"; gene_biotype "pseudogene"; transcript_source "FlyBase"; transcript_biotype "pseudogene";
## mitochondrion_genome FlyBase exon    14917   19524   .   +   .   gene_id "FBgn0013687"; transcript_id "FBgn0013687"; exon_number "1"; gene_name "mt:ori"; gene_source "FlyBase"; gene_biotype "pseudogene"; transcript_source "FlyBase"; transcript_biotype "pseudogene"; exon_id "FBgn0013687-E1";

Indeed it is in the original gtf file.

Use gffutilsto make a transcripts to gene mapping file from gff file

The gff file is not that well defined. One may still be able to use some unix tricks to get the tx2gene.tsv file from a gff file, but it can be rather awkward especially for gff files from other not well annotated species. Instead, let’s use gffutils, a python package to do the same.

install gffutils in terminal:

source activate snakemake
conda install gffutils

Note, I am running python through Rsutdio/ First read how to set python path for reticulate at https://rstudio.github.io/reticulate/articles/versions.html read more on https://cran.r-project.org/web/packages/reticulate/vignettes/versions.html

Somehow, I have to create a .Rprofile in the same folder of .Rproj file with the following line to use my snakemake conda environment which is python3:

Sys.setenv(PATH = paste("/anaconda3/envs/snakemake/bin/", Sys.getenv("PATH"), sep=":"))

library(reticulate)

# check which python I am using
py_discover_config()
## python:         /anaconda3/envs/snakemake/bin//python
## libpython:      /anaconda3/envs/snakemake/lib/libpython3.6m.dylib
## pythonhome:     /anaconda3/envs/snakemake:/anaconda3/envs/snakemake
## version:        3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 14:01:38)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
## numpy:          /anaconda3/envs/snakemake/lib/python3.6/site-packages/numpy
## numpy_version:  1.15.3
## 
## python versions found: 
##  /anaconda3/envs/snakemake/bin//python
##  /usr/bin/python
##  /anaconda3/envs/py27/bin/python
##  /anaconda3/envs/snakemake/bin/python
# these did not work for me...
# use_condaenv("snakemake", required = TRUE)
# use_python("/anaconda3/envs/snakemake/bin/python")
import sys
print(sys.version)
## 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 14:05:31) 
## [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
import gffutils
import itertools
import os
os.listdir()
db = gffutils.create_db("GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.gff.gz", ":memory:", force = True,merge_strategy="merge", id_spec={'gene': 'Dbxref'})
list(db.featuretypes())
# one can do it for one type of features, say mRNA
for mRNA in itertools.islice(db.features_of_type('mRNA'), 10):
        print(mRNA['transcript_id'][0], mRNA['gene'][0])
        #print(mRNA.attributes.items())
        
## but I then have to do the same for lnc_RNA and others.        
## instead, loop over all features in the database
## NM_001103384.3 CG17636
## NM_001258513.2 CG17636
## NM_001258512.2 CG17636
## NM_001297796.1 RhoGAP1A
## NM_001297795.1 RhoGAP1A
## NM_001103385.2 RhoGAP1A
## NM_001103386.2 RhoGAP1A
## NM_001169155.1 RhoGAP1A
## NM_001297797.1 RhoGAP1A
## NM_001297801.1 tyn
tx_and_gene=[]
with open("tx2gene_NCBI.tsv", "w") as f:
        for feature in db.all_features():
                transcript = feature.attributes.get('transcript_id', [None])[0]
                gene = feature.attributes.get('gene', [None])[0]
                if gene and transcript and ([transcript, gene] not in tx_and_gene):
                        tx_and_gene.append([transcript, gene])
                        f.write(transcript + "\t" + gene + "\n")

These lines of codes are not hard to write. It takes more time to read the package documentation and understand how to use the package. One problem with bioinFORMATics is that there are so many different file formats. To make things worse, even for gff file format, many files do not follow the exact specification. You can have a taste of that at http://daler.github.io/gffutils/examples.html.