ChIP-seq-upset
Ming Tang
January 25, 2016
I am going to demonstrate how to make Upset plot using UpSetR package for ChIP-seq peaks. After running a peak-calling method for 3 samples, you want to compare how many peaks are overlapped in two samples and in three samples. People may draw venn diagram for this kind of problem, but when you have more than 3 samples, the venn diagram is hard to read.
# install it if you have not
# install.packages("UpSetR")
library(UpSetR)
dummy<- data.frame(peak=c("peak_1", "peak_1", "peak_1","peak_2","peak_2","peak_3","peak_3","peak_4"),
sample=c("sample_1","sample_2","sample_3","sample_2","sample_3","sample_2","sample_3","sample_1"))
dummy
## peak sample
## 1 peak_1 sample_1
## 2 peak_1 sample_2
## 3 peak_1 sample_3
## 4 peak_2 sample_2
## 5 peak_2 sample_3
## 6 peak_3 sample_2
## 7 peak_3 sample_3
## 8 peak_4 sample_1
To get the format that UpsetR accepts, we have to “spread” the sample column to multiple columns. It is like to change a long format data to wide format. See here as well.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
peaks<- dummy %>% mutate(value =1) %>% spread(sample, value, fill =0 )
peaks
## peak sample_1 sample_2 sample_3
## 1 peak_1 1 1 1
## 2 peak_2 0 1 1
## 3 peak_3 0 1 1
## 4 peak_4 1 0 0
I noticed that one has to convert peaks(local data frame) back to data.frame for UpsetR to work. In this example, I did not convert the peaks to local data frame by tbl_df(peaks)
.
upset(peaks, order.by = "freq")
# install it if you have not
# install.packages("UpSetR")
library(UpSetR)
dummy<- data.frame(peak=c("peak_1", "peak_1", "peak_1","peak_2","peak_2","peak_3","peak_3","peak_4"),
sample=c("sample_1","sample_2","sample_3","sample_2","sample_3","sample_2","sample_3","sample_1"))
dummy
## peak sample
## 1 peak_1 sample_1
## 2 peak_1 sample_2
## 3 peak_1 sample_3
## 4 peak_2 sample_2
## 5 peak_2 sample_3
## 6 peak_3 sample_2
## 7 peak_3 sample_3
## 8 peak_4 sample_1
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
peaks<- dummy %>% mutate(value =1) %>% spread(sample, value, fill =0 )
peaks
## peak sample_1 sample_2 sample_3
## 1 peak_1 1 1 1
## 2 peak_2 0 1 1
## 3 peak_3 0 1 1
## 4 peak_4 1 0 0
tbl_df(peaks)
.upset(peaks, order.by = "freq")
how do i make files for genomics data ? if i have lets sat three samples ,and 10 genes ,where sample are in column and gene in rows how do i make this to give input for UpSet plot
ReplyDelete