Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Sunday, December 31, 2017

The End of 2017

In the end of last year, I wrote a post summarizing the passing 2016. Now, it is time to write the same for 2017! How time flies!

Last year, I wrote:

For the coming 2017, I should be:
1. busy with Phoebe.
2. writing 1-2 papers.
3. writing a book chapter on ChIP-seq for the biostar handbook. It should come out in the mid of 2017.
3. writing a small R package for practice.
4. learning a piano song.

Looking back, it seems I only accomplished 1 and 3 :)  I do have two first-author papers in writing-stage, but I have not finished them yet. I wish I  could get them out in 2018.

The book chapter on ChIP-seq was published here. If you want a PDF of my chapter, you can download from https://drive.google.com/open?id=1UxB0uhsoWlPvymukP3em8v4vBDI-CKwK




I still have not got a chance to write an R package, which is on my list for long. The coming 2018 is a good time for me to get my hands wet. Our epigenomic project was selected by the  Data Science Road-Trip program !! I received the confirmation in the end of 2017. I look forward to learn more R and machine learning for 2 weeks. And the plan is to turn the work into an R package. Many thanks to , I saw this opportunity from his tweet. The development of R package is becoming easier with hadley wickham's work usethis and many others.

I failed #4 totally...I did not get any time to practice the piano. My wife is occupied by Phoebe and does not have time to teach me either.

I have some other achievements that I think I need to celebrate though:)

I am proud to say that I am an official instructor for Data Carpentry! I went to the instructor training in UC Davis hosted by Dr.Titus Brown. I am excited to be part of this welcoming community.

I am also excited to be a  Genomics Advisory committee member of Data Carpentry and a maintainer of the Wrangling Genomics course materials.

I got to practice some teaching skills learned from the instructor training in a ChIP-seq lab, which is part of the GS01 1143 Introduction to Bioinformatics course.

Together with Dr.Kunal Rai, I authored a book chapter on Computational Analysis of Epigenetic Modifications in Melanoma. It should be out in 2018.

The other thing I want to mention is that I wrote several NGS processing workflows using Snakemake, a python extension. It was my first time to write something seriously using python and I like it a lot.

The most complex workflow I have written is the mutation calling pipeline (find it here).
I follow GATK best practices and incorporate two somatic callers: mutect and lancet. In my current lab of Dr.Jianjun Zhang and Dr.Andrew Futreal, I deal with multi-region or longitudinal tumor samples from the same patient. In the pipeline, I implemented copy-number analysis and pyclone clonal architecture analysis. The Snakefile is over 1000 lines, I need to think about modularizing it.

Of course, my ChIP-seq snakemake processing pipeline is used to uniformly process thousands of data sets generated in Dr.Kunal Rai's lab. I am happy that many wet lab members are learning computation and are using it by themselves.

In addition to DNA-seq and ChIP-seq. I developed several other workflows:
RNA-seq
ATAC-seq
RRBSeq

I will need to better document all the workflows.

For the coming 2018:


1. I will need to get out at least 2 papers. I enjoyed the public service such as involving Data Carpentry, but I know if I do not have papers, my academic future is doomed.

2. That does not mean I will spend less time on teaching. In fact, I plan to bring at least one Data Carpentry workshop down to Genomic Medicine department in MD Anderson Cancer Center.

3. Finish that R package of course!

4.  I am expecting my second child Noah in April 2018 ( We survived Hurricane Harvey 2017!!).  I know it will be even busier with 2 kids :) I love my first kid Phoebe, she is now 17 months. The joy she has brought to me is not exchangeable with anything else. I endeavor to be a better researcher, but first I need to be a better husband and father.

Looking forward to a brand new 2018! Good luck everyone.

Wednesday, December 20, 2017

Merge Enhancer promoter interaction bedpe files and recursive function in R

I have download the enhancer-promoter interaction data from this paper Reconstruction of enhancer–target networks in 935 samples of human primary cells, tissues and cell lines

and I want to create a super-set of EP interaction across all cell lines. It turns out to be not that easy...

tools to use: https://github.com/billgreenwald/pgltools

#only for encode and roadmap data
mkdir Encoderoadmap_EP_interaction
cd Encoderoadmap_EP_interaction

wget http://yiplab.cse.cuhk.edu.hk/jeme/encoderoadmap_lasso.zip

have a look of the file

cd /home/mtang1/projects/CCLE_project/data/enhancer_promoter_interaction/Encoderoadmap_EP_interaction/encoderoadmap_lasso
head encoderoadmap_lasso.1.csv 
chrX:100040000-100041800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.49
chrX:100046800-100048000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.37
chrX:100128800-100130000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.39
chrX:99749000-99751000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.47
chrX:99851000-99853000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99854000-99856200,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.68
chrX:99858800-99859600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.59
chrX:99863600-99865600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99866800-99867800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
chrX:99868400-99868600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55

### reformat to pgl format to use the pgltools
### $ sign is tricky to play with in grep, tr to : first
## extend 2kb on the tss 

cat   encoderoadmap_lasso.1.csv | tr "$" ":" |  sed -E 's/(chr[0-9XY]+:[0-9]+-[0-9]+),.+(chr[0-9XY]+:[0-9]+):.+/\1,\2/' | tr ":,-" "\t" | awk -v OFS="\t" '{print $1,$2,$3,$4,$5-2000,$5+2000}' | pgltools formatbedpe |  sed 's/[ \t]*$//'> encoderoadmap_lasso.bedpe

csv2bedpe.sh:

#! /bin/bash

set -eu
set -o pipefail

oprefix=$(basename $1 ".csv")

cat  $1 | tr "$" ":" |  sed -E 's/(chr[0-9XY]+:[0-9]+-[0-9]+),.+(chr[0-9XY]+:[0-9]+):.+/\1,\2/' | tr ":,-" "\t" | awk -v OFS="\t" '{print $1,$2,$3,$4,$5-2000,$5+2000}' | pgltools formatbedpe | sed 's/[ \t]*$//' | awk '{print $0"\tinteraction_"NR"\t.\t.\t."}'  > ${oprefix}.bedpem

merge 127 bedpe files:

see billgreenwald/pgltools#3

find *csv | parallel -j 4 ./csv2bedpe.sh {}
mkdir bedpe
mv *bedpe bedpe/
cd bedpe

ls | head

encoderoadmap_lasso.100.bedpe
encoderoadmap_lasso.101.bedpe
encoderoadmap_lasso.102.bedpe
encoderoadmap_lasso.103.bedpe
encoderoadmap_lasso.104.bedpe
encoderoadmap_lasso.105.bedpe
encoderoadmap_lasso.106.bedpe
encoderoadmap_lasso.107.bedpe
encoderoadmap_lasso.108.bedpe

## takes very long time, let me try something else..., 3 million rows!
## take the unique rows! see here
## https://github.com/billgreenwald/pgltools/issues/3
## using pypy can speed up 
conda install -c ostrokach pypy

# the command line tool will figure out pypy is installed or not.
cat *bedpe | cut -f1-6 | sort | uniq | pgltools sort | pgltools merge -stdInA > ENCODE_merged_interaction.bedpe

I am trying splitting by chromosome and merge bedpe by chromsome to speed up.

Alternative solution: merge the regions using InteractionSet package

follow my previous blog post http://crazyhottommy.blogspot.com/2016/03/breakpoints-clustering-for-structural.html

!!! warning if merge all bedpe togeter, it is more than 3 million lines, and a naive merging will require a lot of RAM (more than 35G), the graph is using a lot of memories I think.

instead, do the recursive way below. This helps only when you expect to see many regions that are merged between samples.

library(rtracklayer)
library(InteractionSet)
library(here)

merge_bedpe<- function(gi){
  out<- findOverlaps(gi)
  library(RBGL)
  g <- ftM2graphNEL(as.matrix(out), W=NULL, V=NULL, edgemode="directed")
  edgemode(g) <- "undirected"
  connections <- connectedComp(g)
  cluster <- integer(length(gi))

  for (i in seq_along(connections)) {
    cluster[as.integer(connections[[i]])] <- i
    }
  boundingBox(gi, cluster)
}


list.files("data/enhancer_promoter_interaction/Encoderoadmap_EP_interaction/encoderoadmap_lasso/bedpe", pattern="bedpe")

bedpe_dir<- "data/enhancer_promoter_interaction/Encoderoadmap_EP_interaction/encoderoadmap_lasso/bedpe"

library(rtracklayer)

recursive_merge<- function(n){
  if (n == 1){
    bedpe<- import(here(bedpe_dir, "encoderoadmap_lasso.1.bedpe"), format = "bedpe")
    gi<- GInteractions(first(bedpe), second(bedpe), mode= "strict")
    merge_bedpe(gi)

  } else
  {
    bedpe_n<- import(here(bedpe_dir, paste0("encoderoadmap_lasso.", n, ".bedpe")), format = "bedpe")
    gi_n<- GInteractions(first(bedpe_n), second(bedpe_n), mode= "strict")
    gi_n_minus_1<- recursive_merge(n-1)
    merge_bedpe(c(gi_n, gi_n_minus_1))
  }
}

recursive_merge(1)
recursive_merge(2)

## total 127 files, it takes long, but will not crash your computer if you expect to see many overlapping pairs between samples.

merged_encode_bedpe<- recursive_merge(127)

The trick is the gi_n_minus_1<- recursive_merge(n-1) inside the function, it is calling the function itself recursively.

This trick is similar in python to calculate the factorial of n:

def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n - 1)