Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, May 16, 2018

get the peaks that shared in multiple samples

get the peaks that shared in multiple samples

There was a lot of discussion on this topic in biostars:

https://www.biostars.org/p/172566/
https://www.biostars.org/p/13516/

Two tools can be used: the glorious bedtools intersect and the under-appreciated bedmap.

I found bedmap is more flexiable for my need. I have 48 peak super-enhancer bed files generated by ROSE and want to find peaks that are present in a proportion of the samples.

Have a look of one file:

head M028-P008_C_vs_M028-P008_G_macs1_peaks_Gateway_SuperEnhancers.bed
chr5	135307797	135469385	30_MACS_peak_11809_lociStitched	2	.
chr7	47748031	47878196	18_MACS_peak_13755_lociStitched	3	.
chr2	20723436	20836114	24_MACS_peak_7567_lociStitched	1	.
chr7	129901924	130004106	13_MACS_peak_14530_lociStitched	4	.
chr1	204232472	204333056	16_MACS_peak_1819_lociStitched	14	.
chr20	55350354	55450040	14_MACS_peak_9456_lociStitched	20	.
chr20	55486991	55585604	18_MACS_peak_9466_lociStitched	13	.
chr7	31443763	31540165	14_MACS_peak_13378_lociStitched	7	.
chr15	68865245	68960466	19_MACS_peak_5648_lociStitched	16	.
chr7	47491363	47585935	9_MACS_peak_13733_lociStitched	6	.

Following the link above:

cat *.bed | sort-bed -  | bedmap --count --echo --delim '\t' - | head
2	chr1	838816	879282	4_MACS_peak_3_lociStitched	581	.
21	chr1	859816	976303	11_MACS_peak_2_lociStitched	138	.
10	chr1	890960	919997	4_MACS_peak_5_lociStitched	1163	.
20	chr1	893593	963946	7_MACS_peak_7_lociStitched	33	.
10	chr1	893728	919995	3_MACS_peak_1_lociStitched	1071	.
20	chr1	893728	978529	11_MACS_peak_7_lociStitched	278	.
9	chr1	893749	917890	4_MACS_peak_5_lociStitched	1112	.
20	chr1	893831	976376	9_MACS_peak_6_lociStitched	77	.
20	chr1	894077	969891	14_MACS_peak_8_lociStitched	357	.
20	chr1	913405	963993	4_MACS_peak_7_lociStitched	276	.

Note in this case, the <ref-file> and the <map-file> is the same.

The first column is the number of peaks that region overlaps. So, really the number of that peak overlap with others should be column 1 -1 because a region itself will overlap with itself.

what's the maximum number of overlapping peaks?

cat *.bed | sort-bed -  | bedmap --count --echo --delim '\t' - | cut -f1 | sort -k1,1nr | uniq -c | head
      1 115
      1 99
      1 98
      2 96
      1 94
      1 91
      1 88
      1 84
      1 80
      1 79

I only have 48 samples, so I expect to see the biggest number of 48? No, because a peak in one file can be really big and it can overlap with multiple peaks in another file e.g.

sample 1:     |-----------------------|
sample 2:   |----|    |----|      |-----|

we only have 2 samples in this example, but the single peak in sample2 has 3 peaks overlap with it. The other problem is that if sample1 has two peaks that overlap with each other, it will be counted as well:

sample 1:     |-----------------------|
                                |-------|

sample 2:   |----|    |----|      |-----|

For the large peak in sample1, it can have 4 peaks overlap with it.

add a unique id for each sample in column4

The peaks from ROSE has an id like4_MACS_peak_3_lociStitched which can be shared by mulitple samples, although it is less likely than the peak id from macs14 : MACS_peak_1. The peak calls from macs2 adds the sample name: my_sample_macs2_peaks1.

To make the most from bedops, one needs to add a unique id for each region of each sample.

save the following script as add_name.sh:

set -e
set -u
set -o pipefail

file=$1

fileName=$(echo $1 | sed -E 's/_C_vs_.+//')

cat $file | awk -v id="$fileName" '$4=id"_peak"NR'  OFS="\t" > ${fileName}_add_name_peaks.bed

my file names are like M028-P008_C_vs_M028-P008_G_macs1_peaks_Gateway_SuperEnhancers.bed
this line of code:fileName=$(echo $1 | sed -E 's/_C_vs_.+//') will get M028-P008 as my sample name. change that line accordingly to your own file names.

awk -v id="$fileName" '$4=id"_peak"NR' will change column 4 to a name like: M028-P008_peak1, M028-P008_peak2 ...

change column4 for all samples:

chmod u+x add_names.sh

find *bed | parallel -j 6 './add_name.sh {}'

check the peaks from the same file after altering column 4:

chr5	135307797	135469385	M028-P008_peak1	2	.
chr7	47748031	47878196	M028-P008_peak2	3	.
chr2	20723436	20836114	M028-P008_peak3	1	.
chr7	129901924	130004106	M028-P008_peak4	4	.
chr1	204232472	204333056	M028-P008_peak5	14	.
chr20	55350354	55450040	M028-P008_peak6	20	.
chr20	55486991	55585604	M028-P008_peak7	13	.
chr7	31443763	31540165	M028-P008_peak8	7	.
chr15	68865245	68960466	M028-P008_peak9	16	.
chr7	47491363	47585935	M028-P008_peak10	6	.

we appended the sample name to the column4.

mkdir clean_bed
# move all the newly generated file into this folder
mv *add_name_peaks.bed clean_bed

cd clean_bed

use bedmap again. This time, we use --echo-map-id-uniq option


cat  *add_name_peaks.bed | sort-bed - |  bedmap --echo --echo-map-id-uniq - | awk -F"|" '(split($2, a, ";") >=48)' | wc -l
429

## this is the same as 

cat *add_name_peaks.bed | sort-bed -  | bedmap --count --echo --delim '\t' - | awk '$1 >=48' | wc -l
429

# but now we have more information
# depending on how many columns your bed file has, I splited column 7
cat  *add_name_peaks.bed | sort-bed - |  bedmap --echo --echo-map-id-uniq --delim '\t' - | awk '(split($7, a, ";") >=48)' | head

The last column now has the id of the peaks from each sample separated by ;.

Now, if you really wants to get the number of overlapping peaks in a sample wise manner, you need to parse the last column extracting unique sample ids.

one example:

chr1	9292931	9510959	2549-STC1_peak4	17	.	2173-STC1_peak134;2173-STC1_peak173;2357-STC1_peak5;2379-STC1_peak487;2379-STC1_peak58;2379-STC1_peak757;2549-STC1_peak4;2559-STC1_peak327;2559-STC1_peak926;2698-STC1_peak435;2761-STC1_peak728;2761-STC1_peak8;2765-STC1_peak53;2812-STC1_peak146;2812-STC1_peak257;2812-STC1_peak774;M-12_peak776;M-48_peak11;M-52_peak3;M-57_peak371;M-57_peak525;M-57_peak559;M-61_peak747;M-84_peak77;M-940_peak571;M-96_peak116;M-96_peak663;M-A2058_peak269;M-G361_peak260;M-G361_peak90;M-HS934_peak203;M-WM115_peak300;M-WM165_peak14;M-WM266_peak36;M-WM266_peak663;M028-P008_peak707;M035-P010_peak212;M137-P008_peak620;M137-P008_peak797;M263-P011_peak327;M275-P008_peak177;M275-P008_peak813;M305-P009_peak644;M306-P008_peak377;M306-P008_peak552;M399-P010_peak104;M399-P010_peak480;M409-P011_peak290;M642-P008_peak9;M762-P008_peak470;M762-P008_peak679;M822-P010_peak235;M822-P010_peak553;M827-P011_peak261;M852-P008_peak291;M852-P008_peak428

remove the suffix _peak4 etc for column7 (requires gawk) https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

cat  *add_name_peaks.bed | sort-bed - |  bedmap --echo --echo-map-id-uniq --delim '\t' - | awk '(split($7, a, ";") >=48)' | head -1 | gawk -v OFS="\t" '{a=gensub(/_peak[0-9]+/, "", "g",  $7); $7=a; print}'
chr1	2089095	2255972	M-48_peak15	24	.	2173-STC1;2173-STC1;2357-STC1;2379-STC1;2379-STC1;2549-STC1;2559-STC1;2559-STC1;2761-STC1;2761-STC1;2765-STC1;2792-STC1;2792-STC1;2812-STC1;2812-STC1;M-22;M-48;M-52;M-52;M-57;M-57;M-61;M-61;M-84;M-940;M-940;M-96;M-96;M-A2058;M-A2058;M-G361;M-G361;M-HS934;M-HS934;M-WM165;M-WM266;M-WM266;M028-P008;M035-P010;M137-P008;M263-P011;M263-P011;M275-P008;M306-P008;M357-P010;M357-P010;M399-P010;M399-P010;M527-P010;M527-P010;M642-P008;M721-P010;M721-P010;M749-P010;M749-P010;M762-P008;M762-P008;M822-P010;M822-P010;M852-P008

you see the last column is now only containing the sample names.

The awk commands becoming awkward. it maybe a good time to switch to python/R for more text manipulation. let me stick with unix commands for now :)

The next step is to get the unique sample count from column 7.

Final touch up for all files

cat  *add_name_peaks.bed | sort-bed - |  bedmap --echo --echo-map-id-uniq --delim '\t' - | awk '(split($7, a, ";") >=48)' | gawk -v OFS="\t" '{a=gensub(/_peak[0-9]+/, "", "g",  $7); $7=a; print}' > merged_super.bed 

while read line;do cnt=`echo "$line" | cut -f7 | tr ";" "\n" | sort -u | wc -l`;echo -e "$line\t$cnt";done < merged_super.bed > merged_super_clean.bed

or the R way:

library(stringr)
library(tidyverse)

bed<- read_tsv("merged_super.bed", col_names = F)

bed %>% mutate(X8 = unlist(lapply(str_split(X7, ";"), function(x) length(unique(x))))) %>% write.table("merged_super_clean.bed", quote =F, sep = "\t", row.names =F, col.names =F)

select super-enhancers present in 40 out of 48 samples

you can change the cut-off based on how many you want to validate.

cat merged_super_clean.bed| awk '$8 >= 40' | sort-bed - | bedtools merge -i - | wc -l

35

cat merged_super_clean.bed| awk '$8 >= 40' | sort-bed - | bedtools merge -i - > melanoma_super_40_sample.bed

write shell scripts for the whole process.

since I may do this for multiple times, I put the shell scripts together require bedtools and bedops in your path.

add unique name to the peak file to the fourth column put the bed files in to the same folder and do

find *sorted.bed | parallel '../../../scripts/add_uinque_name_for_peak.sh {} _macs1_peaks_Gateway_SuperEnhancers.sorted.bed'

add_uinque_name_for_peak.sh:

#! /bin/bash

set -euo pipefail

file=$1
#the pattern you want to remove from the end of the file name
pattern=$2

sampleName=${file%${pattern}}

## add unique names for each peak for each sample

cat $file | cut -f1-3|  awk -v id="$sampleName" '$4=id"_peak"NR'  OFS="\t" > ${sampleName}_add_name_peaks.bed

add the number of samples each peak present to the sixth column.

get_peaks_present_in_a_subset_of_samples.sh:

#! /bin/bash

set -euo pipefail

# see https://www.unix.com/unix-for-beginners-questions-and-answers/270526-awk-count-unique-element-array.html

cat  *add_name_peaks.bed \
| sort-bed - \
| bedmap --echo --echo-map-id-uniq --delim '\t' - \
| gawk -v OFS="\t" '{a=gensub(/_peak[0-9]+/, "", "g",  $5); $5=a; print}' \
| awk '{split(x,C); n=split($5,F,/;/); for(i in F) if(C[F[i]]++) n--; print $0 "\t" n}'

The output can be futher filtered by e.g awk '$6 >=10' to filter down the peaks present in at least 10 samples. The output then can be piped to bedtools merge -i to merge to get your final peak set.

annotate peaks with ChIPseeker

Because the super-enhancer size is usually big, ~100kb. we need to report all genes that overlap with it using addFlankGeneInfo.

see YuLab-SMU/ChIPseeker#55

library(ChIPseeker)
library(rtracklayer)

bed<- import("melanoma_super_40_sample.bed", format= "BED")

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## see https://github.com/GuangchuangYu/ChIPseeker/issues/55
#
super_anno <- annotatePeak(bed, tssRegion=c(-3000, 3000), 
                         TxDb=txdb, annoDb="org.Hs.eg.db", level = "gene", overlap = "TSS",
                          addFlankGeneInfo = TRUE,  flankDistance = 5000)

super_anno<-super_anno %>% 
  as.data.frame()%>% 
  dplyr::select(seqnames,start, end, geneId, distanceToTSS, SYMBOL, flank_geneIds, flank_gene_distances) %>%
  mutate(flank_geneIds =strsplit(as.character(flank_geneIds), ";")) %>% 
  mutate(flank_gene_distances =strsplit(as.character(flank_gene_distances), ";")) %>%
  unnest(flank_geneIds, flank_gene_distances) 

## get the gene symbol for the flank genes.
super_flank_gene_symbols<- AnnotationDbi::select(org.Hs.eg.db, keys=super_anno$flank_geneIds, 
                                    columns="SYMBOL", keytype="ENTREZID") %>%
  dplyr::rename(flank_symbol= SYMBOL)

super_anno<- cbind(super_anno, super_flank_gene_symbols)


write.table(super_anno,  "melanoma_super_40samples_anno.tsv", quote =F, sep = "\t", row.names =F, col.names = T)

1 comment: