I wrote up a short tutorial for
PCA, MDS, k-means, Hierarchical clustering and heatmap for microarray data
and put it on Rpubs.please follow the link https://rpubs.com/crazyhottommy/PCA_MDS
Happy learning!
A wet-dry hybrid biologist's take on genetics and genomics. Mostly is about Linux, R, python, reproducible research, open science and NGS. Grab my book to transform yourself to a computational biologist https://divingintogeneticsandgenomics.ck.page/
${parameter:-defaultValue} | Get default shell variables value |
${parameter:=defaultValue} | Set default shell variables value |
${parameter:?”Error Message”} | Display an error message if parameter is not set |
${#var} | Find the length of the string |
${var%pattern} | Remove from shortest rear (end) pattern |
${var%%pattern} | Remove from longest rear (end) pattern |
${var:num1:num2} | Substring |
${var#pattern} | Remove from shortest front pattern |
${var##pattern} | Remove from longest front pattern |
${var/pattern/string} | Find and replace (only replace first occurrence) |
${var//pattern/string} | Find and replace all occurrences |
command file part_file.result
file
and print it out:file=foo.txt
echo "$file"
foo.txt
txt
to pdf
. one of the commonly known ways is to use thebasename
built-in function:echo "$(basename $file .txt).pdf"
foo.pdf
${string/pattern/replacement}
${string//pattern/replacement}
Following syntax replaces with the replacement string,
Following syntax replaces with the replacement string,
only when the pattern matches at the end of the given $string.
${string/%pattern/replacement}
sed
. see my previous blog post hereecho "${file/txt/pdf}"
foo.pdf
# a more complex exmaple
file_1=foo.txt.foo.txt
echo "${file_1//foo/bar}"
bar.txt.bar.txt
echo "${file_1/foo/bar}"
bar.txt.foo.txt
echo "${file_1/#foo/bar}"
bar.txt.foo.txt
echo "${file_1/%txt/pdf}"
foo.txt.foo.pdf
${string%substring} will delete the shortest match of substring
from back${string%%substring}
will delete the longest match of substring from backecho "${file_1%txt*}pdf"
foo.txt.foo.pdf
echo "${file_1%%txt*}pdf"
foo.pdf
{string#substring}
will delete the shortest match of substring from the begining{string##substring}
will delete the longest match of substring from the beginingecho "bar${file_1#foo*}"
bar.txt.foo.txt
echo "bar${file_1##foo*}.pdf"
bar.pdf
${string:position}
Extract substring from $string at $position
${string:position:length}
$length of characters substring from $string starting from $position
echo "${file_1:4}"
txt.foo.txt
echo "${file_1:4:7}"
txt.foo
echo "${#file_1}"
15
sudo pip -H install MACS2
wget http://www.biomedcentral.com/content/supplementary/1471-2105-13-199-s2.gz
install.packages("~/NCIS/1471-2105-13-199-s2.gz", repos = NULL, type="source")
library(NCIS); ?NCIS
--ratio
flag:macs2 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1 --ration 1.4
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam.bai
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam.bai
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdPk.narrowPeak.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam.bai
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7InputUcdAlnRep2.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7InputUcdAlnRep2.bam.bai
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam.bai
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam.bai
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam.bai
diff_ChIP_test
is the project main folder and it contains four sub-folders: data
, doc
, results
and scripts
. Data are downloaded into the data
folder, scripts are in the scripts
folder and the output from the scripts are in the results
folder.While I can't speak to much of the first paragraph of your message, I wanted to let you know that when it comes to histone modification analysis, using the --call-supeaks option in the command line has proven to be a phenomenally successful way to use MACS to handle modifications (right now, I'm working on H3K4me1) which have both sharp defined peaks but also very broad "peaks" which are more like domains of H3K4me1 signals clustered together. This obviously created many situations, in part because MACS will automatically combine peaks which are separated by a distance of 10bp or less into one called peak, where artefactually broad regions were called (like 30kb peaks... yeah right!).So many people have argued, including in a recent review in Nat Immunology about ChIP-seq and peak-calling, that modifications with broad distribution (even H3K27Ac can be broad in our hands) should not be analyzed with MACS; using various biological end-points as testing I have found that this is not true and that MACS outperforms ZINBA (specifically designed for broad histone modifications) WHEN the subpeaks function is called for detection of histone peaks with broad, narrow, or both types of signal distribution.
--broad
When this flag is on, MACS will try to composite broad regions in BED12 ( a gene-model-like format ) by putting nearby highly enriched regions into a broad region with loose cutoff. The broad region is controlled by another cutoff through --broad-cutoff. The maximum length of broad region length is 4 times of d from MACS. DEFAULT: False
macs2 callpeak -t ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n panc1H3k27acRep1 --outdir panc1H3k27acRep1_with_model_broad
macs2 callpeak -t ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n panc1H3k27acRep1 --outdir panc1H3k27acRep1_without_model_broad --nomodel --extsize 146
$ Rscript NAME_model.r
d
If the d is not small ~ < 2*tag size (for those tag size < 50bp), and the model image in PDF shows clean bimodal shape, d may be good. And several bp differences on d shouldn't affect the peak detection on general transcription factor ChIP-seq much.However, for Pol2 or histone marks, things may be different. Pol2 is moving so it's not appropriate to say there is a fixed fragment size. I don't know the correct answer. For histone mark ChIP-seq, since they would have a underlying characteristic 147bp resolution for a nucleosome size, you can simply skip model building and use "--shiftsize 74 --nomodel" instead. Also if you want, you can try other software like SICER and NPS.
macs2 callpeak -t ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam -g hs -q 0.01 -n panc1H3k27acRep1_regular --call-summit -B --outdir panc1H3k27acRep1_without_model_regular --nomodel --extsize 146
bamCoverage -b ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam --normalizeTo1x 2451960000 --missingDataAsZero yes --binSize 10 -o panc1_H3k27acRep1_deeptool_normalized.bw
NAME_peaks.gappedPeak is in BED12+3 format which contains both the broad region and narrow peaks. The 5th column is 10*-log10qvalue, to be more compatible to show grey levels on UCSC browser. Tht 7th is the start of the first narrow peak in the region, and the 8th column is the end. The 9th column should be RGB color key, however, we keep 0 here to use the default color, so change it if you want. The 10th column tells how many blocks including the starting 1bp and ending 1bp of broad regions. The 11th column shows the length of each blocks, and 12th for the starts of each blocks. 13th: fold-change, 14th: -log10pvalue, 15th: -log10qvalue. The file can be loaded directly to UCSC genome browser.
zcat ../data/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak.gz| wc -l
macs2 callpeak -t ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n panc1H3k27acRep1 --outdir panc1H3k27acRep1_without_model_broad --nomodel --extsize 146
bedtools intersect -a panc1H3k27acRep1_with_model_broad/panc1H3k27acRep1_peaks.broadPeak -b panc1H3k27acRep1_without_model_broad/panc1H3k27acRep1_peaks.broadPeak -wa | cut -f1-3 | sort | uniq | wc -l
45228bedtools intersect -a panc1H3k27acRep1_with_model_broad/panc1H3k27acRep1_peaks.broadPeak -b panc1H3k27acRep1_without_model_broad/panc1H3k27acRep1_peaks.broadPeak -wb | cut -f1-3 | sort | uniq | wc -l
61507bedtools intersect -a panc1H3k27acRep1_without_model_broad/panc1H3k27acRep1_peaks.broadPeak -b ../data/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak -wa | cut -f1-3 | sort | uniq | wc -l
35722bedtools intersect -a panc1H3k27acRep1_with_model_broad/panc1H3k27acRep1_peaks.broadPeak -b ../data/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak -wa | cut -f1-3 | sort | uniq | wc -l
28943cat panc1H3k27acRep1_without_model_broad/panc1H3k27acRep1_peaks.broadPeak | awk '$9 >2 && $7 >2' | wc -l
41136cat panc1H3k27acRep1_with_model_broad/panc1H3k27acRep1_peaks.broadPeak | awk '$9 >2 && $7 >2' | wc -l
35487--ratio
flag. NCIS library needs the ChIP bam and input control bam file as two arguments. I need to write a R script and excute on the command line by using Rscript
:## calculate the scaling factor using NCIS for ChIP-seq data
# see links https://groups.google.com/forum/#!searchin/macs-announcement/NCIS/macs-announcement/0EF4cQF09FI/2-zlu2rqfOkJ
# http://searchvoidstar.tumblr.com/post/52594053877/adding-a-custom-normalization-to-macs
# Ming Tang 06/15/2015
library(NCIS)
library(ShortRead)
options(echo=TRUE) # set to FALSE if you not want see commands in output
args <- commandArgs(trailingOnly = TRUE)
print(args)
# trailingOnly=TRUE means that only your arguments are returned, check:
# print(commandsArgs(trailingOnly=FALSE))
ChIP_bam<- args[1]
input_control_bam<- args[2]
# NCIS usese the Aligned Reads object from the shortRead package, however, it is recommended
# to use GenomicAignments package to read in the bam files
# ga_ChIP<- readGAlignments(ChIP_bam)
# ga_input<-readGAlignments(input_control_bam)
# However, the resulting GenomicAlignment object is not recognized by NCIS.
# I have to use the legacy readAligned function from the ShortRead package.
# it takes around 15mins to finish
ga_ChIP<- readAligned(ChIP_bam, type="BAM")
ga_input<-readAligned(input_control_bam, type="BAM")
res<- NCIS(ga_ChIP, ga_input, data.type="AlignedRead")
res
res$est
res$r.seq.depth
NCIS_scaling.r
and on terminal:Rscript ChIP.bam control.bam
--ratio
is also for ChIP/control. --ratio RATIO When set, use a custom scaling ratio of ChIP/control
(e.g. calculated using NCIS) for linear scaling.
DEFAULT: ingore
Rscript NCIS_scaling.r ../data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam ../data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam
Rscript NCIS_scaling.r ../data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam ../data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam
Rscript NCIS_scaling.r ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam
Rscript NCIS_scaling.r ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam
file_name | scaling factor | seq depth ratio |
---|---|---|
Mcf7Rep1 | 1.761996 | 2.091474 |
Mcf7Rep2 | 1.709105 | 1.917619 |
panc1Rep1 | 0.5861799 | 1.110217 |
panc1Rep2 | 0.5160342 | 0.8233171 |
ls -sh data/*
4.0K data/README.md
1.3G data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam
6.2M data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam.bai
1.2G data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam
6.1M data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam.bai
468K data/wgEncodeSydhHistoneMcf7H3k27acUcdPk.narrowPeak.gz
773M data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam
5.9M data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam.bai
718M data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam
6.1M data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam.bai
522M data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam
5.9M data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam.bai
4.4M data/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak
678M data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam
6.1M data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam.bai
macs2 callpeak -t data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam -c data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n Mcf7H3k27acUcdAlnRep1_ratio --outdir results/Mcf7H3k27acUcdAlnRep1_ratio --nomodel --extsize 146 --ratio 1.761996
macs2 callpeak -t data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam -c data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n Mcf7H3k27acUcdAlnRep2_ratio --outdir results/Mcf7H3k27acUcdAlnRep2_ratio --nomodel --extsize 146 --ratio 1.709105
macs2 callpeak -t data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n Panc17H3k27acUcdAlnRep1_ratio --outdir results/Panc1H3k27acUcdAlnRep1_ratio --nomodel --extsize 146 --ratio 0.5861799
macs2 callpeak -t data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam -c data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n Panc17H3k27acUcdAlnRep2_ratio --outdir results/Panc1H3k27acUcdAlnRep2_ratio --nomodel --extsize 146 --ratio 0.5861799
#! /bin/bash
set -e
set -u
set -o pipefail -o errexit -o nounset
# we loop for the ChIP bam files
for bam in ../data/*H3k27ac*bam
do
# strip out only the meaningful filename to be used for output
file_name=$(echo "$bam" | sed -E "s/..\/data\/wg.+Histone(.+)(H3k27ac.+).bam/\1\2/")
# need to retain the ../data/ path. it could be simply: sed -E "s/H3k27ac/Input/" if
# every bam file has a input control
input_control=$(echo "$bam" | sed -E "s/(wg.+Histone)(.+)(H3k27ac.+).bam/\1\2InputUcdAlnRep1.bam/")
echo "processing ${file_name} bam file"
echo "the input control file is ${input_control}"
echo "calling peaks with macs2"
macs2 callpeak -t "$bam" -c "${input_control}" --broad -g hs --broad-cutoff 0.1 -n "${file_name}" --outdir ../results/"${file_name}" --nomodel --extsize 146
done
+
operator to function in sed, one needs to turn on the -E
flag for extended regular expression on macOS, or -r
on GNU sed.sed 's/\([a-z]*\).*/\1'
sed -E 's/([a-z]*).*/\1'
results
folder:Mcf7H3k27acUcdAlnRep1
,Mcf7H3k27acUcdAlnRep2
,Panc1H3k27acUcdAlnRep1
and Panc1H3k27acUcdAlnRep2
file_name | with --ratio | without --ratio | overlapping |
---|---|---|---|
Mcf7Rep1 | 25744 | 18572 | 18572 |
Mcf7Rep2 | 17564 | 12257 | 12256 |
panc1Rep1 | 348968 | 66288 | 66282 |
panc1Rep2 | 71033 | 54312 | 54229 |
Functional genomics experiments based on next-gen sequencing (e.g. ChIP-seq, MNase-seq, DNase-seq, FAIRE-seq) that measure biochemical activity of various elements in the genome often produce artifact signal in certain regions of the genome. It is important to keep track of and filter artifact regions that tend to show artificially high signal (excessive unstructured anomalous reads mapping). Below is a list of comprehensive empirical blacklists identified by the ENCODE and modENCODE consortia. Note that these blacklists were empirically derived from large compendia of data using a combination of automated heuristics and manual curation. These blacklists are applicable to functional genomic data based on short-read sequencing (20-100bp reads). These are not directly applicable to RNA-seq or any other transcriptome data types. The blacklisted regions typically appear u niquely mappable so simple mappability filters do not remove them. These regions are often found at specific types of repeats such as centromeres, telomeres and satellite repeats. It is especially important to remove these regions that computing measures of similarity such as Pearson correlation between genome-wide tracks that are especially affected by outliers.
data
folder by wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz
broad_peaks_all
inside the results
folder, and copy all the broad peaks to this folder.ls ./broad_peaks_all
Mcf7H3k27acUcdAlnRep1_peaks.broadPeak
Panc1H3k27acUcdAlnRep1_peaks.broadPeak
Mcf7H3k27acUcdAlnRep2_peaks.broadPeak
Panc1H3k27acUcdAlnRep2_peaks.broadPeak
#! /bin/bash
## filter the broad peaks from MACS output against the blacklist regions
## bedtools intersect -v
set -e
set -u
set -o pipefail -o errexit -o nounset
for peak in ../results/broad_peaks_all/*broadPeak
do
file_name=$(basename $peak .broadPeak)
bedtools intersect -a "$peak" -b ../data/wgEncodeDacMapabilityConsensusExcludable.bed -v \
> ../results/broad_peaks_all/"${file_name}".filtered.bed
done
filter_blacklist.sh
and run it at terminal in the script
folder ./fiter_blacklist.sh
.chmod u+x filter_blacklist.sh
wc -l /broad_peaks_all/*
18572 Mcf7H3k27acUcdAlnRep1_peaks.broadPeak
18548 Mcf7H3k27acUcdAlnRep1_peaks.filtered.bed
12257 Mcf7H3k27acUcdAlnRep2_peaks.broadPeak
12239 Mcf7H3k27acUcdAlnRep2_peaks.filtered.bed
66288 Panc1H3k27acUcdAlnRep1_peaks.broadPeak
66248 Panc1H3k27acUcdAlnRep1_peaks.filtered.bed
54312 Panc1H3k27acUcdAlnRep2_peaks.broadPeak
54235 Panc1H3k27acUcdAlnRep2_peaks.filtered.bed
rm *broadPeak
bedtools intersect -a Mcf7H3k27acUcdAlnRep1_peaks.filtered.bed -b Mcf7H3k27acUcdAlnRep2_peaks.filtered.bed -wa | cut -f1-3 | sort | uniq > Mcf7Rep1_peaks.bed
bedtools intersect -a Mcf7H3k27acUcdAlnRep1_peaks.filtered.bed -b Mcf7H3k27acUcdAlnRep2_peaks.filtered.bed -wb | cut -f1-3 | sort | uniq > Mcf7Rep2_peaks.bed
bedtools intersect -a Panc1H3k27acUcdAlnRep1_peaks.filtered.bed -b Panc1H3k27acUcdAlnRep2_peaks.filtered.bed -wa | cut -f1-3 | sort | uniq > Panc1Rep1_peaks.bed
bedtools intersect -a Panc1H3k27acUcdAlnRep1_peaks.filtered.bed -b Panc1H3k27acUcdAlnRep2_peaks.filtered.bed -wb | cut -f1-3 | sort | uniq > Panc1Rep2_peaks.bed
wc -l *
18548 Mcf7H3k27acUcdAlnRep1_peaks.filtered.bed
12239 Mcf7H3k27acUcdAlnRep2_peaks.filtered.bed
9679 Mcf7Rep1_peaks.bed
11319 Mcf7Rep2_peaks.bed
66248 Panc1H3k27acUcdAlnRep1_peaks.filtered.bed
54235 Panc1H3k27acUcdAlnRep2_peaks.filtered.bed
35218 Panc1Rep1_peaks.bed
44358 Panc1Rep2_peaks.bed
rm *filtered*
cat *bed | sort -k1,1 -k2,2n | bedtools merge | tee merge.bed | wc -l