There was a lot of discussion on this topic in biostars:
https://www.biostars.org/p/172566/
https://www.biostars.org/p/13516/
Two tools can be used: the glorious bedtools intersect
and the under-appreciated bedmap
.
I found bedmap
is more flexiable for my need. I have 48 peak super-enhancer bed files generated by ROSE
and want to find peaks that are present in a proportion of the samples.
Have a look of one file:
head M028-P008_C_vs_M028-P008_G_macs1_peaks_Gateway_SuperEnhancers.bed
chr5 135307797 135469385 30_MACS_peak_11809_lociStitched 2 .
chr7 47748031 47878196 18_MACS_peak_13755_lociStitched 3 .
chr2 20723436 20836114 24_MACS_peak_7567_lociStitched 1 .
chr7 129901924 130004106 13_MACS_peak_14530_lociStitched 4 .
chr1 204232472 204333056 16_MACS_peak_1819_lociStitched 14 .
chr20 55350354 55450040 14_MACS_peak_9456_lociStitched 20 .
chr20 55486991 55585604 18_MACS_peak_9466_lociStitched 13 .
chr7 31443763 31540165 14_MACS_peak_13378_lociStitched 7 .
chr15 68865245 68960466 19_MACS_peak_5648_lociStitched 16 .
chr7 47491363 47585935 9_MACS_peak_13733_lociStitched 6 .
Following the link above:
cat *.bed | sort-bed - | bedmap --count --echo --delim '\t' - | head
2 chr1 838816 879282 4_MACS_peak_3_lociStitched 581 .
21 chr1 859816 976303 11_MACS_peak_2_lociStitched 138 .
10 chr1 890960 919997 4_MACS_peak_5_lociStitched 1163 .
20 chr1 893593 963946 7_MACS_peak_7_lociStitched 33 .
10 chr1 893728 919995 3_MACS_peak_1_lociStitched 1071 .
20 chr1 893728 978529 11_MACS_peak_7_lociStitched 278 .
9 chr1 893749 917890 4_MACS_peak_5_lociStitched 1112 .
20 chr1 893831 976376 9_MACS_peak_6_lociStitched 77 .
20 chr1 894077 969891 14_MACS_peak_8_lociStitched 357 .
20 chr1 913405 963993 4_MACS_peak_7_lociStitched 276 .
Note in this case, the <ref-file>
and the <map-file>
is the same.
The first column is the number of peaks that region overlaps. So, really the number of that peak overlap with others should be column 1 -1
because a region itself will overlap with itself.
what's the maximum number of overlapping peaks?
cat *.bed | sort-bed - | bedmap --count --echo --delim '\t' - | cut -f1 | sort -k1,1nr | uniq -c | head
1 115
1 99
1 98
2 96
1 94
1 91
1 88
1 84
1 80
1 79
I only have 48 samples, so I expect to see the biggest number of 48? No, because a peak in one file can be really big and it can overlap with multiple peaks in another file e.g.
sample 1: |-----------------------|
sample 2: |----| |----| |-----|
we only have 2 samples in this example, but the single peak in sample2 has 3 peaks overlap with it. The other problem is that if sample1 has two peaks that overlap with each other, it will be counted as well:
sample 1: |-----------------------|
|-------|
sample 2: |----| |----| |-----|
For the large peak in sample1, it can have 4 peaks overlap with it.
The peaks from ROSE
has an id like4_MACS_peak_3_lociStitched
which can be shared by mulitple samples, although it is less likely than the peak id from macs14
: MACS_peak_1
. The peak calls from macs2
adds the sample name: my_sample_macs2_peaks1
.
To make the most from bedops
, one needs to add a unique id for each region of each sample.
save the following script as add_name.sh
:
set -e
set -u
set -o pipefail
file=$1
fileName=$(echo $1 | sed -E 's/_C_vs_.+//')
cat $file | awk -v id="$fileName" '$4=id"_peak"NR' OFS="\t" > ${fileName}_add_name_peaks.bed
my file names are like M028-P008_C_vs_M028-P008_G_macs1_peaks_Gateway_SuperEnhancers.bed
this line of code:fileName=$(echo $1 | sed -E 's/_C_vs_.+//')
will get M028-P008
as my sample name. change that line accordingly to your own file names.
awk -v id="$fileName" '$4=id"_peak"NR'
will change column 4 to a name like: M028-P008_peak1, M028-P008_peak2 ...
change column4 for all samples:
chmod u+x add_names.sh
find *bed | parallel -j 6 './add_name.sh {}'
check the peaks from the same file after altering column 4:
chr5 135307797 135469385 M028-P008_peak1 2 .
chr7 47748031 47878196 M028-P008_peak2 3 .
chr2 20723436 20836114 M028-P008_peak3 1 .
chr7 129901924 130004106 M028-P008_peak4 4 .
chr1 204232472 204333056 M028-P008_peak5 14 .
chr20 55350354 55450040 M028-P008_peak6 20 .
chr20 55486991 55585604 M028-P008_peak7 13 .
chr7 31443763 31540165 M028-P008_peak8 7 .
chr15 68865245 68960466 M028-P008_peak9 16 .
chr7 47491363 47585935 M028-P008_peak10 6 .
we appended the sample name to the column4.
mkdir clean_bed
# move all the newly generated file into this folder
mv *add_name_peaks.bed clean_bed
cd clean_bed
use bedmap again. This time, we use --echo-map-id-uniq
option
cat *add_name_peaks.bed | sort-bed - | bedmap --echo --echo-map-id-uniq - | awk -F"|" '(split($2, a, ";") >=48)' | wc -l
429
## this is the same as
cat *add_name_peaks.bed | sort-bed - | bedmap --count --echo --delim '\t' - | awk '$1 >=48' | wc -l
429
# but now we have more information
# depending on how many columns your bed file has, I splited column 7
cat *add_name_peaks.bed | sort-bed - | bedmap --echo --echo-map-id-uniq --delim '\t' - | awk '(split($7, a, ";") >=48)' | head
The last column now has the id of the peaks from each sample separated by ;
.
Now, if you really wants to get the number of overlapping peaks in a sample wise manner, you need to parse the last column extracting unique sample ids.
one example:
chr1 9292931 9510959 2549-STC1_peak4 17 . 2173-STC1_peak134;2173-STC1_peak173;2357-STC1_peak5;2379-STC1_peak487;2379-STC1_peak58;2379-STC1_peak757;2549-STC1_peak4;2559-STC1_peak327;2559-STC1_peak926;2698-STC1_peak435;2761-STC1_peak728;2761-STC1_peak8;2765-STC1_peak53;2812-STC1_peak146;2812-STC1_peak257;2812-STC1_peak774;M-12_peak776;M-48_peak11;M-52_peak3;M-57_peak371;M-57_peak525;M-57_peak559;M-61_peak747;M-84_peak77;M-940_peak571;M-96_peak116;M-96_peak663;M-A2058_peak269;M-G361_peak260;M-G361_peak90;M-HS934_peak203;M-WM115_peak300;M-WM165_peak14;M-WM266_peak36;M-WM266_peak663;M028-P008_peak707;M035-P010_peak212;M137-P008_peak620;M137-P008_peak797;M263-P011_peak327;M275-P008_peak177;M275-P008_peak813;M305-P009_peak644;M306-P008_peak377;M306-P008_peak552;M399-P010_peak104;M399-P010_peak480;M409-P011_peak290;M642-P008_peak9;M762-P008_peak470;M762-P008_peak679;M822-P010_peak235;M822-P010_peak553;M827-P011_peak261;M852-P008_peak291;M852-P008_peak428
remove the suffix _peak4
etc for column7 (requires gawk
)
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
cat *add_name_peaks.bed | sort-bed - | bedmap --echo --echo-map-id-uniq --delim '\t' - | awk '(split($7, a, ";") >=48)' | head -1 | gawk -v OFS="\t" '{a=gensub(/_peak[0-9]+/, "", "g", $7); $7=a; print}'
chr1 2089095 2255972 M-48_peak15 24 . 2173-STC1;2173-STC1;2357-STC1;2379-STC1;2379-STC1;2549-STC1;2559-STC1;2559-STC1;2761-STC1;2761-STC1;2765-STC1;2792-STC1;2792-STC1;2812-STC1;2812-STC1;M-22;M-48;M-52;M-52;M-57;M-57;M-61;M-61;M-84;M-940;M-940;M-96;M-96;M-A2058;M-A2058;M-G361;M-G361;M-HS934;M-HS934;M-WM165;M-WM266;M-WM266;M028-P008;M035-P010;M137-P008;M263-P011;M263-P011;M275-P008;M306-P008;M357-P010;M357-P010;M399-P010;M399-P010;M527-P010;M527-P010;M642-P008;M721-P010;M721-P010;M749-P010;M749-P010;M762-P008;M762-P008;M822-P010;M822-P010;M852-P008
you see the last column is now only containing the sample names.
The awk commands becoming awkward. it maybe a good time to switch to python/R for more text manipulation. let me stick with unix commands for now :)
The next step is to get the unique sample count from column 7.
Final touch up for all files
cat *add_name_peaks.bed | sort-bed - | bedmap --echo --echo-map-id-uniq --delim '\t' - | awk '(split($7, a, ";") >=48)' | gawk -v OFS="\t" '{a=gensub(/_peak[0-9]+/, "", "g", $7); $7=a; print}' > merged_super.bed
while read line;do cnt=`echo "$line" | cut -f7 | tr ";" "\n" | sort -u | wc -l`;echo -e "$line\t$cnt";done < merged_super.bed > merged_super_clean.bed
or the R way:
library(stringr)
library(tidyverse)
bed<- read_tsv("merged_super.bed", col_names = F)
bed %>% mutate(X8 = unlist(lapply(str_split(X7, ";"), function(x) length(unique(x))))) %>% write.table("merged_super_clean.bed", quote =F, sep = "\t", row.names =F, col.names =F)
you can change the cut-off based on how many you want to validate.
cat merged_super_clean.bed| awk '$8 >= 40' | sort-bed - | bedtools merge -i - | wc -l
35
cat merged_super_clean.bed| awk '$8 >= 40' | sort-bed - | bedtools merge -i - > melanoma_super_40_sample.bed
since I may do this for multiple times, I put the shell scripts together require bedtools and bedops in your path.
add unique name to the peak file to the fourth column put the bed files in to the same folder and do
find *sorted.bed | parallel '../../../scripts/add_uinque_name_for_peak.sh {} _macs1_peaks_Gateway_SuperEnhancers.sorted.bed'
add_uinque_name_for_peak.sh
:
#! /bin/bash
set -euo pipefail
file=$1
#the pattern you want to remove from the end of the file name
pattern=$2
sampleName=${file%${pattern}}
## add unique names for each peak for each sample
cat $file | cut -f1-3| awk -v id="$sampleName" '$4=id"_peak"NR' OFS="\t" > ${sampleName}_add_name_peaks.bed
add the number of samples each peak present to the sixth column.
get_peaks_present_in_a_subset_of_samples.sh
:
#! /bin/bash
set -euo pipefail
# see https://www.unix.com/unix-for-beginners-questions-and-answers/270526-awk-count-unique-element-array.html
cat *add_name_peaks.bed \
| sort-bed - \
| bedmap --echo --echo-map-id-uniq --delim '\t' - \
| gawk -v OFS="\t" '{a=gensub(/_peak[0-9]+/, "", "g", $5); $5=a; print}' \
| awk '{split(x,C); n=split($5,F,/;/); for(i in F) if(C[F[i]]++) n--; print $0 "\t" n}'
The output can be futher filtered
by e.g awk '$6 >=10'
to filter down the peaks present in at least 10 samples. The output then can be piped
to bedtools merge -i
to merge to get your final peak set.
Because the super-enhancer size is usually big, ~100kb. we need to report all genes that overlap with it using addFlankGeneInfo
.
library(ChIPseeker)
library(rtracklayer)
bed<- import("melanoma_super_40_sample.bed", format= "BED")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## see https://github.com/GuangchuangYu/ChIPseeker/issues/55
#
super_anno <- annotatePeak(bed, tssRegion=c(-3000, 3000),
TxDb=txdb, annoDb="org.Hs.eg.db", level = "gene", overlap = "TSS",
addFlankGeneInfo = TRUE, flankDistance = 5000)
super_anno<-super_anno %>%
as.data.frame()%>%
dplyr::select(seqnames,start, end, geneId, distanceToTSS, SYMBOL, flank_geneIds, flank_gene_distances) %>%
mutate(flank_geneIds =strsplit(as.character(flank_geneIds), ";")) %>%
mutate(flank_gene_distances =strsplit(as.character(flank_gene_distances), ";")) %>%
unnest(flank_geneIds, flank_gene_distances)
## get the gene symbol for the flank genes.
super_flank_gene_symbols<- AnnotationDbi::select(org.Hs.eg.db, keys=super_anno$flank_geneIds,
columns="SYMBOL", keytype="ENTREZID") %>%
dplyr::rename(flank_symbol= SYMBOL)
super_anno<- cbind(super_anno, super_flank_gene_symbols)
write.table(super_anno, "melanoma_super_40samples_anno.tsv", quote =F, sep = "\t", row.names =F, col.names = T)
Amazing Post good artical your site
ReplyDeleteEbay Pakistan | Ebay in Pakistan | Ebay Products in Pakistan