Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Friday, September 13, 2019

My opinionated selection of books/urls for bioinformatics/data science curriculum

There was a paper on this topic: A New Online Computational Biology Curriculum.
I am going to provide a biased list below (I have read most of the books if not all). I say it is biased because you will see many books of R are from Hadely Wickham. I now use tidyverse most of the time.


I suggest people who want to learn bioinformatics starting to learn unix commands first. It is so powerful and also omnipresent in high-performance computing settings (clouding computing etc). You can not survive without knowing it.

Computational biology

R programming

  • R for data science by Garrett Grolemund and Hadley Wickham.
  • Advanced R by Hadley Wickham.
  • R packages by Hadley Wickham. If you want to transit from an R user to developer, writing an R package will get you started.

Stats (R focused)

Python programming

Machine learning


Those two books are not teaching you how to make figures programmatically (although the book by Claus was generated by Rmarkdown and the codes for all the figures can be found at They teach you what kind of figures are informative and pleasant to eyes. From data to viz is a website guiding you to choose the right graph for your data.
I am still using R/ggplot2 for visualization.
What’s your favorite book that I have missed? Comment below!

Tuesday, September 3, 2019

How to upload files to GEO


1. create account

Go to NCBI GEO: Create User ID and password. my username is research_guru
I used my google account.

2. fill in the xls sheet

Downloaded the meta xls sheet from
## bgzip the fastqs

cd 01seq
find *fastq | parallel bgzip
md5sum *fastq.gz > fastq_md5.txt 
# copy to excle
cat fastq_md5.txt | awk '{print $2}'

#copy to excle
cat fastq_md5.txt | awk '{print $1}'

cd ../07bigwig
#get the md5sum

md5sum *bw > bigwig_md5.txt

# sample name, copy to excel
cat bigwig_md5.txt | awk '{print $2}'

# md5, copy to excel
cat bigwig_md5.txt | awk '{print $1}'

cd ../08peak_macs1

md5sum *macs1_peaks.bed > peaks_md5.txt
# copy to excel
cat peaks_md5.txt | awk '{print $2}'

cd ..
mkdir research_guru_KMT2D_ChIPseq

cd research_guru_KMT2D_ChIPseq

## fill in the xls sheet and save in this folder
This is the most time-consuming/tedious step.
soft link does not work for me…
ln  /rsrch2/genomic_med/krai/hunain_histone_reseq/snakemake_ChIPseq_pipeline_downsample/07bigwig/*bw .

ln /rsrch2/genomic_med/krai/hunain_histone_reseq/snakemake_ChIPseq_pipeline_downsample/08peak_macs1/*macs1_peaks.bed .

4. upload to GEO

# inside the folder: research_guru_KMT2D_ChIPseq

## type in the user name and the password
## this is not your GEO account user name.
## everyone uses the same `geo` and the same password below.


#name: geo
#password: 33%9uyj_fCh?M16H

ftp> prompt n
Interactive mode off.

ftp> cd fasp

# make a folder in the ftp site
ftp> mkdir research_guru_ChIPseq

ftp> cd research_guru_ChIPseq

#upload all the files
ftp> mput *

5. telling NCBI you uploaded stuff

After your transfer is complete, you need to tell the NCBI.
After file transfer is complete, please e-mail GEO with the following information: - GEO account username (; - Names of the directory and files deposited; - Public release date (required - up to 3 years from now - see FAQ).

Side notes

for paired-end sequencing data. the xls sheet requires you to fill in the average insert size and the std.
picard CollectInsertSizeMetrics can do this job.
time java -jar /scratch/genomic_med/apps/picard/picard-tools-2.13.2/picard.jar CollectInsertSizeMetrics I=4-Mll4-RasG12D-1646-2-cd45_S40_L006.sorted.bam  H=4-Mll4-RasG12D-1646-2-cd45_S40_L006_insert.pdf  O=4-Mll4-RasG12D-1646-2-cd45_S40_L006_insert.txt

# finish in ~5mins

Tuesday, January 2, 2018

cancer gene census copy number

Set up

## here() starts at /Users/mtang1/projects/mixing_histology_lung_cancer
root.dir<- here()
opts_knit$set(root.dir = root.dir)

read in the data

I am going to check the COSMIC database on cancer genes. Specifically, I want to know which cancer-related genes are found amplified and which are deleted.
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
cancer_gene_census<- read_csv("data/COSMIC_Cancer_gene_Census/cancer_gene_census.csv", col_names = T)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Entrez GeneId` = col_integer(),
##   Tier = col_integer()
## )
## See spec(...) for full column specifications.

tidy the data with janitor::clean_names(), dplyr::unnest

The column names have spaces, and in many columns there are mulitple strings separated by ,.
## dplyr after 0.7.0 use pull to get one column out as a vector, I was using .$

#cancer_gene_census %>% pull(`Gene Symbol`) %>% unique()
#cancer_gene_census %>% distinct(`Gene Symbol`)

## extract genes that are amplified or deleted in cancers. in the `mutation types` column search A for amplificaton and D for large deletion.

## use janitor::clean_names() to clean the column names with space.
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% tabyl(mutation_type)
##    mutation_type   n      percent valid_percent
## 1              N   1 0.0008257638  0.0008264463
## 2              A   8 0.0066061107  0.0066115702
## 3              D   8 0.0066061107  0.0066115702
## 4              F 139 0.1147811726  0.1148760331
## 5            Mis  68 0.0561519405  0.0561983471
## 6              N 147 0.1213872832  0.1214876033
## 7              O  37 0.0305532618  0.0305785124
## 8              S  76 0.0627580512  0.0628099174
## 9              T  19 0.0156895128  0.0157024793
## 10             A  18 0.0148637490  0.0148760331
## 11             D  35 0.0289017341  0.0289256198
## 12             F  32 0.0264244426  0.0264462810
## 13        F; Mis   1 0.0008257638  0.0008264463
## 14             M   3 0.0024772915  0.0024793388
## 15           Mis 240 0.1981833196  0.1983471074
## 16        Mis. N   1 0.0008257638  0.0008264463
## 17             N  24 0.0198183320  0.0198347107
## 18             O   7 0.0057803468  0.0057851240
## 19  Promoter Mis   1 0.0008257638  0.0008264463
## 20             S   2 0.0016515277  0.0016528926
## 21             T 343 0.2832369942  0.2834710744
## 22          <NA>   1 0.0008257638            NA
It turns out that single mutation_types column is more messy than you think… Multiple entries of A and D in different rows. spaces in the column are the devil.

trim the spaces with stringr::str_trim()

# trim the white space
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% tabyl(mutation_type)
##   mutation_type  n   percent
## 1             A 26 0.3768116
## 2             D 43 0.6231884
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% count(mutation_type)
## # A tibble: 2 x 2
##   mutation_type     n
##           <chr> <int>
## 1             A    26
## 2             D    43

Sanity check

according to the website there are should be 40 deletions and 24 amplifications while I am getting 43 and 26, respectively.
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A") 
## # A tibble: 26 x 21
##    gene_symbol
##          <chr>
##  1        AKT2
##  2        AKT3
##  3         ALK
##  4       CCNE1
##  5      DROSHA
##  6        EGFR
##  7       ERBB2
##  8         ERG
##  9        FLT4
## 10        GRM3
## # ... with 16 more rows, and 20 more variables: name <chr>,
## #   entrez_geneid <int>, genome_location <chr>, tier <int>,
## #   hallmark <chr>, chr_band <chr>, somatic <chr>, germline <chr>,
## #   tumour_types_somatic <chr>, tumour_types_germline <chr>,
## #   cancer_syndrome <chr>, tissue_type <chr>, molecular_genetics <chr>,
## #   role_in_cancer <chr>, mutation_types <chr>,
## #   translocation_partner <chr>, other_germline_mut <chr>,
## #   other_syndrome <chr>, synonyms <chr>, mutation_type <chr>
I checked by eyes, AKT3 and GRM3 are missing from the table online but when I checked the downloaded table, both AKT3 and GRM3 has Amplification in the mutation types column. I am bit confused, but for now I will stick to the downloaded table.
It teaches me an important lesson.
  1. Do not trust data (blindly) from any resource even COSMIC.
  2. Data are always messy. Tidying data is the big job.
  3. Becareful with the data, check with sanity. Browse the data with your eyes may reveal some unexpected things.

oncogenes are amplified and tumor suppressor genes are always deleted?

In cancer, we tend to think oncogenes are amplified and tumor suppressor genes are deleted to promote tumor progression. Is it true in this setting?
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% 
        mutate(role = strsplit(as.character(role_in_cancer), ",")) %>%
        unnest(role) %>%
        mutate(role = str_trim(role)) %>%
        filter(role == "TSG" | role == "oncogene") %>%
        count(role, mutation_type) %>% kable()

what are those genes?

cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% 
        mutate(role = strsplit(as.character(role_in_cancer), ",")) %>%
        unnest(role) %>%
        mutate(role = str_trim(role)) %>%
        filter(role == "TSG" | role == "oncogene") %>%
        filter((role == "TSG" & mutation_type == "A") | (role == "oncogene" & mutation_type == "D")) %>%
        select(gene_symbol, role_in_cancer, role, mutation_types, mutation_type) %>%
        kable(format = "html", booktabs = T, caption = "TSG and oncogene") %>%
        kable_styling(latex_options = c("striped", "hold_position"),
                full_width = F)
TSG and oncogene
APOBEC3Boncogene, TSGoncogeneDD
BIRC3oncogene, TSG, fusiononcogeneD, F, N, T, MisD
GPC3oncogene, TSGoncogeneD, Mis, N, F, SD
KDM6Aoncogene, TSGoncogeneD, N, F, SD
MAP2K4oncogene, TSGoncogeneD, Mis, ND
NKX2-1oncogene, TSGTSGAA
NTRK1oncogene, TSG, fusionTSGT, AA
PAX5oncogene, TSG, fusiononcogeneT, Mis, D, F, SD
WT1oncogene, TSG, fusiononcogeneD, Mis, N, F, S, TD
majority of the genes from the output can function as either oncogene or tumor suppressor genes (TSG), which is not totally suprising. Cancer is such a complex disease that a function of the gene is dependent on the context (e.g. cancer types). Interestingly, DROSHA is the only TSG and is amplified.

Sunday, December 31, 2017

The End of 2017

In the end of last year, I wrote a post summarizing the passing 2016. Now, it is time to write the same for 2017! How time flies!

Last year, I wrote:

For the coming 2017, I should be:
1. busy with Phoebe.
2. writing 1-2 papers.
3. writing a book chapter on ChIP-seq for the biostar handbook. It should come out in the mid of 2017.
3. writing a small R package for practice.
4. learning a piano song.

Looking back, it seems I only accomplished 1 and 3 :)  I do have two first-author papers in writing-stage, but I have not finished them yet. I wish I  could get them out in 2018.

The book chapter on ChIP-seq was published here. If you want a PDF of my chapter, you can download from

I still have not got a chance to write an R package, which is on my list for long. The coming 2018 is a good time for me to get my hands wet. Our epigenomic project was selected by the  Data Science Road-Trip program !! I received the confirmation in the end of 2017. I look forward to learn more R and machine learning for 2 weeks. And the plan is to turn the work into an R package. Many thanks to , I saw this opportunity from his tweet. The development of R package is becoming easier with hadley wickham's work usethis and many others.

I failed #4 totally...I did not get any time to practice the piano. My wife is occupied by Phoebe and does not have time to teach me either.

I have some other achievements that I think I need to celebrate though:)

I am proud to say that I am an official instructor for Data Carpentry! I went to the instructor training in UC Davis hosted by Dr.Titus Brown. I am excited to be part of this welcoming community.

I am also excited to be a  Genomics Advisory committee member of Data Carpentry and a maintainer of the Wrangling Genomics course materials.

I got to practice some teaching skills learned from the instructor training in a ChIP-seq lab, which is part of the GS01 1143 Introduction to Bioinformatics course.

Together with Dr.Kunal Rai, I authored a book chapter on Computational Analysis of Epigenetic Modifications in Melanoma. It should be out in 2018.

The other thing I want to mention is that I wrote several NGS processing workflows using Snakemake, a python extension. It was my first time to write something seriously using python and I like it a lot.

The most complex workflow I have written is the mutation calling pipeline (find it here).
I follow GATK best practices and incorporate two somatic callers: mutect and lancet. In my current lab of Dr.Jianjun Zhang and Dr.Andrew Futreal, I deal with multi-region or longitudinal tumor samples from the same patient. In the pipeline, I implemented copy-number analysis and pyclone clonal architecture analysis. The Snakefile is over 1000 lines, I need to think about modularizing it.

Of course, my ChIP-seq snakemake processing pipeline is used to uniformly process thousands of data sets generated in Dr.Kunal Rai's lab. I am happy that many wet lab members are learning computation and are using it by themselves.

In addition to DNA-seq and ChIP-seq. I developed several other workflows:

I will need to better document all the workflows.

For the coming 2018:

1. I will need to get out at least 2 papers. I enjoyed the public service such as involving Data Carpentry, but I know if I do not have papers, my academic future is doomed.

2. That does not mean I will spend less time on teaching. In fact, I plan to bring at least one Data Carpentry workshop down to Genomic Medicine department in MD Anderson Cancer Center.

3. Finish that R package of course!

4.  I am expecting my second child Noah in April 2018 ( We survived Hurricane Harvey 2017!!).  I know it will be even busier with 2 kids :) I love my first kid Phoebe, she is now 17 months. The joy she has brought to me is not exchangeable with anything else. I endeavor to be a better researcher, but first I need to be a better husband and father.

Looking forward to a brand new 2018! Good luck everyone.

Wednesday, November 22, 2017

run gistic2 with sequenza segmentation output from whole exome sequencing

Convert sequenza output to gistic input

Gistic was designed for SNP6 array data. I saw many papers use it for whole exome sequencing data as well.
I have the segment files from sequenza and want to convert them to the gistic input.

Input format for gistic:

segment file:
(1) Sample (sample name)
(2) Chromosome (chromosome number)
(3) Start Position (segment start position, in bases)
(4) End Position (segment end position, in bases)
(5) Num markers (number of markers in segment)
(6) Seg.CN (log2() -1 of copy number)
  1. The conversion should be log2 (logarithm base 2) - 1, so that copy number 2 is 0.
  2. Every segment start and end in the segments file should appear in the markers file, not the other way around.
when the copy number is 0 (a homozygous deletion of both copies). You can’t do a log2(0)-1, just put a small number e.g. -5
(1) Marker Name
(2) Chromosome
(3) Marker Position (in bases)
Note gistic2 does not require a marker file anymore.

output of sequenza

sequenza gives a segment file. Segmentation was done by copynumberbioconductor package.
13 columns of the *segments.txt file
"chromosome" "start.pos" "end.pos" "Bf" "N.BAF" "sd.BAF" "depth.ratio" "N.ratio" "sd.ratio" "CNt" "A" "B" "LPP"
We only need the chromosomestart.posend.posN.BAF and depth.ratio columns.
The depth.ratio column is the GC content normalized ratio. a depth ratio of 1 means it has copy number of 2 (the same as the normal blood control in my case).
UPDATED: 12/17/2017. see a comment below. it is not log2(2^ depth.ratio) -1 rather:
To convert to gistic input, I have to do log2(2 * depth.ratio) - 1

UPDATED 01/03/2018
I have a bunch of sgement files in the same folder.
add the sample name in the final column and do the log2 math in R.
seg_files<- list.files(".", pattern = "*segments.txt", full.names = F) 

seg_dat_list <- lapply(seg_files, function(f) {
        dat<- read_tsv(f, col_names = T, col_types = cols(.default = col_character()))
        sample<- gsub("_vs_.*segments.txt", "", f)
        dat$sample<- sample

seg_dat <-, seg_dat_list)

gistic_input<- seg_dat %>% select(sample, chromosome, start.pos, end.pos, N.BAF, depth.ratio) %>% mutate(depth.ratio = as.numeric(depth.ratio)) %>% mutate(depth.ratio = log2(2 * depth.ratio) -1)

write_tsv(gistic_input, "all_segments.txt")
Back to bash:

## marker file:

cat all_segments.txt | sed '1d' | cut -f2,3 > markers.txt
cat all_segments.txt | sed '1d' | cut -f2,4 >> markers.txt

## sort the files by chromosome, take the unique ones and number the markers.

cat markers.txt | sort -V -k1,1 -k2,2nr | uniq | nl > markers_gistic.txt

Run gistic
modify the gistic2 script a bit. e.g. change MCR_ROOT folder path
## set MCR environment and launch GISTIC executable

## NOTE: change the line below if you have installed the Matlab MCR in an alternative location

echo Setting Matlab MCR root to $MCR_ROOT

## set up environment variables

## launch GISTIC executable
./gp_gistic2_from_seg $@

I removed ./ from the last line since I have put all executables in my path.
mkdir gistic_out
gistic2 -b gistic_out -seg all_segments.txt -refgene /scratch/genomic_med/apps/gistic/refgenefiles/hg19.mat -mk markers_gistic.txt  -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme