I have download the enhancer-promoter interaction data from this paper Reconstruction of enhancer–target networks in 935 samples of human primary cells, tissues and cell lines
and I want to create a super-set of EP interaction across all cell lines. It turns out to be not that easy...
tools to use: https://github.com/billgreenwald/pgltools
#only for encode and roadmap data
mkdir Encoderoadmap_EP_interaction
cd Encoderoadmap_EP_interaction
wget http://yiplab.cse.cuhk.edu.hk/jeme/encoderoadmap_lasso.zip
have a look of the file
cd /home/mtang1/projects/CCLE_project/data/enhancer_promoter_interaction/Encoderoadmap_EP_interaction/encoderoadmap_lasso
head encoderoadmap_lasso.1.csv
chrX:100040000-100041800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.49
chrX:100046800-100048000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.37
chrX:100128800-100130000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.39
chrX:99749000-99751000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.47
chrX:99851000-99853000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99854000-99856200,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.68
chrX:99858800-99859600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.59
chrX:99863600-99865600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99866800-99867800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
chrX:99868400-99868600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
### reformat to pgl format to use the pgltools
### $ sign is tricky to play with in grep, tr to : first
## extend 2kb on the tss
cat encoderoadmap_lasso.1.csv | tr "$" ":" | sed -E 's/(chr[0-9XY]+:[0-9]+-[0-9]+),.+(chr[0-9XY]+:[0-9]+):.+/\1,\2/' | tr ":,-" "\t" | awk -v OFS="\t" '{print $1,$2,$3,$4,$5-2000,$5+2000}' | pgltools formatbedpe | sed 's/[ \t]*$//'> encoderoadmap_lasso.bedpe
csv2bedpe.sh
:
#! /bin/bash
set -eu
set -o pipefail
oprefix=$(basename $1 ".csv")
cat $1 | tr "$" ":" | sed -E 's/(chr[0-9XY]+:[0-9]+-[0-9]+),.+(chr[0-9XY]+:[0-9]+):.+/\1,\2/' | tr ":,-" "\t" | awk -v OFS="\t" '{print $1,$2,$3,$4,$5-2000,$5+2000}' | pgltools formatbedpe | sed 's/[ \t]*$//' | awk '{print $0"\tinteraction_"NR"\t.\t.\t."}' > ${oprefix}.bedpem
find *csv | parallel -j 4 ./csv2bedpe.sh {}
mkdir bedpe
mv *bedpe bedpe/
cd bedpe
ls | head
encoderoadmap_lasso.100.bedpe
encoderoadmap_lasso.101.bedpe
encoderoadmap_lasso.102.bedpe
encoderoadmap_lasso.103.bedpe
encoderoadmap_lasso.104.bedpe
encoderoadmap_lasso.105.bedpe
encoderoadmap_lasso.106.bedpe
encoderoadmap_lasso.107.bedpe
encoderoadmap_lasso.108.bedpe
## takes very long time, let me try something else..., 3 million rows!
## take the unique rows! see here
## https://github.com/billgreenwald/pgltools/issues/3
## using pypy can speed up
conda install -c ostrokach pypy
# the command line tool will figure out pypy is installed or not.
cat *bedpe | cut -f1-6 | sort | uniq | pgltools sort | pgltools merge -stdInA > ENCODE_merged_interaction.bedpe
I am trying splitting by chromosome and merge bedpe by chromsome to speed up.
follow my previous blog post http://crazyhottommy.blogspot.com/2016/03/breakpoints-clustering-for-structural.html
!!! warning if merge all bedpe togeter, it is more than 3 million lines, and a naive merging will require a lot of RAM (more than 35G), the graph is using a lot of memories I think.
instead, do the recursive way below. This helps only when you expect to see many regions that are merged between samples.
library(rtracklayer)
library(InteractionSet)
library(here)
merge_bedpe<- function(gi){
out<- findOverlaps(gi)
library(RBGL)
g <- ftM2graphNEL(as.matrix(out), W=NULL, V=NULL, edgemode="directed")
edgemode(g) <- "undirected"
connections <- connectedComp(g)
cluster <- integer(length(gi))
for (i in seq_along(connections)) {
cluster[as.integer(connections[[i]])] <- i
}
boundingBox(gi, cluster)
}
list.files("data/enhancer_promoter_interaction/Encoderoadmap_EP_interaction/encoderoadmap_lasso/bedpe", pattern="bedpe")
bedpe_dir<- "data/enhancer_promoter_interaction/Encoderoadmap_EP_interaction/encoderoadmap_lasso/bedpe"
library(rtracklayer)
recursive_merge<- function(n){
if (n == 1){
bedpe<- import(here(bedpe_dir, "encoderoadmap_lasso.1.bedpe"), format = "bedpe")
gi<- GInteractions(first(bedpe), second(bedpe), mode= "strict")
merge_bedpe(gi)
} else
{
bedpe_n<- import(here(bedpe_dir, paste0("encoderoadmap_lasso.", n, ".bedpe")), format = "bedpe")
gi_n<- GInteractions(first(bedpe_n), second(bedpe_n), mode= "strict")
gi_n_minus_1<- recursive_merge(n-1)
merge_bedpe(c(gi_n, gi_n_minus_1))
}
}
recursive_merge(1)
recursive_merge(2)
## total 127 files, it takes long, but will not crash your computer if you expect to see many overlapping pairs between samples.
merged_encode_bedpe<- recursive_merge(127)
The trick is the gi_n_minus_1<- recursive_merge(n-1)
inside the function, it is calling the function itself recursively.
This trick is similar in python to calculate the factorial of n:
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n - 1)
No comments:
Post a Comment