When structural variants are represented in bedpe
format, each structural variant is represented as two linked breakpoints. Each breakpoint is represented as a genomic interval (GRanges). It is very similar to representations of Genomic interactions (data from Hi-C, ChIA-PET):
e.g.
chr1 100 200 chr2 300 400
one of the common task is to merge the breakpoints when they are overlapping.
———|———|———————–|———-|—–
—–|———|————————-|—————|—
merged to
—–|————-|———————|—————|—
It is not a trivial problem as stated by Aaron Qunlan in this
post
It is the so called “breakpoint clustering” problem. I am going to use InteractionSet
bioconductor package to solve this problem.
Please also check
clusterPairs
function in the
diffHic package.
see
answer from the author of
InteractionSet
. Note that
InteractionSet
is still under development. Please refer to the tutorial for more
usages. Install it using devtools:
install_github("LTLA/InteractionSet").
Make some dummy GRanges.
library(InteractionSet)
all.regions <- GRanges(rep("chrA",8),
IRanges(c(1,6,2,9,5,2,15,20), c(3,10,4,12,7,4,18,23)))
index.1 <- c(1,3,5,7)
index.2 <- c(2,4,6,8)
Using mode
=strict
when constructing the GInteraction object or using the swapAnchors
method is to ensure that the first anchor index is always less than the second anchor index for each interaction. This eliminates redundant permutations of anchor regions and ensures that an interaction between regions #1 and #2 is treated the same as an interaction between regions #2 and #1. Obviously, this assumes that redundant permutations are uninteresting.
gi <- GInteractions(index.1, index.2, all.regions, mode ="strict")
gi
## StrictGInteractions object with 4 interactions and 0 metadata columns:
## seqnames1 ranges1 seqnames2 ranges2
## <Rle> <IRanges> <Rle> <IRanges>
## [1] chrA [ 1, 3] --- chrA [ 6, 10]
## [2] chrA [ 2, 4] --- chrA [ 9, 12]
## [3] chrA [ 2, 4] --- chrA [ 5, 7]
## [4] chrA [15, 18] --- chrA [20, 23]
## -------
## regions: 8 ranges and 0 metadata columns
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The first three pairs of interactions can be merged. Note that two anchors of each pair are overlapping. First, use findOverlaps
for GInteraction object to find out which pairs with two anchors are overlapping. (2D overlapping)
out<- findOverlaps(gi)
out
## Hits object with 8 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 1
## [2] 1 2
## [3] 1 3
## [4] 2 1
## [5] 2 2
## [6] 3 1
## [7] 3 3
## [8] 4 4
## -------
## queryLength: 4 / subjectLength: 4
This will identify all pairs of interactions in gi that have two-dimensional overlaps with each other, i.e., both anchor regions in one interaction overlap with corresponding anchor regions in the other interaction. Using graph algorithm to cluster the anchors (I myself is by no means an algorithm person :)). We then do:
library(RBGL)
g <- ftM2graphNEL(as.matrix(out), W=NULL, V=NULL, edgemode="directed")
library(Rgraphviz)
plot(g)
edgemode(g) <- "undirected"
plot(g)
connections <- connectedComp(g)
To identify groups of interactions that overlap at least one other interaction in the same group (i.e., “single-linkage clusters” of interactions that have overlapping areas in the two-dimensional interaction space). The clusters can be explicitly identified using:
cluster <- integer(length(gi))
for (i in seq_along(connections)) {
cluster[as.integer(connections[[i]])] <- i
}
We can then identify the bounding box (merge the overlapping interactions) for each cluster of interactions with:
boundingBox(gi, cluster)
## GInteractions object with 2 interactions and 0 metadata columns:
## seqnames1 ranges1 seqnames2 ranges2
## <Rle> <IRanges> <Rle> <IRanges>
## 1 chrA [ 1, 4] --- chrA [ 5, 12]
## 2 chrA [15, 18] --- chrA [20, 23]
## -------
## regions: 4 ranges and 0 metadata columns
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
overlap with one anchor and get the other anchor
I have of.interest
overlaps with the second
anchor, and I want to return the first
anchor. how can I do it?
e.g.
first (promoters) second
——|————-|———————-|————|——– gi
|------------| of.interest (enhancers)
of.interest <- GRanges(c("chrA","chrA"), IRanges(start=c(2, 5),end=c(8,8)))
hits1 <- findOverlaps(gi, of.interest, use.region="first")
anchors(gi[queryHits(hits1),], type="first")
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrA [1, 3] *
## [2] chrA [2, 4] *
## [3] chrA [2, 4] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
anchors(gi[queryHits(hits1),], type="second")
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrA [6, 10] *
## [2] chrA [9, 12] *
## [3] chrA [5, 7] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
of.interest[subjectHits(hits1)]
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrA [2, 8] *
## [2] chrA [2, 8] *
## [3] chrA [2, 8] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
hits2 <- findOverlaps(gi, of.interest, use.region="second")
anchors(gi[queryHits(hits2),], type="second")
## GRanges object with 4 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrA [6, 10] *
## [2] chrA [6, 10] *
## [3] chrA [5, 7] *
## [4] chrA [5, 7] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
anchors(gi[queryHits(hits2),], type="first")
## GRanges object with 4 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrA [1, 3] *
## [2] chrA [1, 3] *
## [3] chrA [2, 4] *
## [4] chrA [2, 4] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
of.interest[subjectHits(hits2)]
## GRanges object with 4 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrA [2, 8] *
## [2] chrA [5, 8] *
## [3] chrA [2, 8] *
## [4] chrA [5, 8] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If you’re willing to sacrifice some information, you might consider using the linearize method:
gi$index <- seq_along(gi)
linearize(gi, of.interest[1])
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | index
## <Rle> <IRanges> <Rle> | <integer>
## [1] chrA [1, 10] * | 1
## [2] chrA [9, 12] * | 2
## [3] chrA [2, 7] * | 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
linearize(gi, of.interest[2])
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | index
## <Rle> <IRanges> <Rle> | <integer>
## [1] chrA [1, 3] * | 1
## [2] chrA [2, 4] * | 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
This will return a GRanges containing the “other” anchor region for all interactions that overlap each entry in of.interest (you can figure out what those interactions were by looking at index in the returned object). linearize also provides options for handling “internal” interactions where both anchor regions overlap the specified entry of of.interest; by default, the union of the two anchor regions is returned for such interactions, but they can be removed entirely by setting internal=FALSE.