Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, May 16, 2018

get the peaks that shared in multiple samples

get the peaks that shared in multiple samples

There was a lot of discussion on this topic in biostars:

https://www.biostars.org/p/172566/
https://www.biostars.org/p/13516/

Two tools can be used: the glorious bedtools intersect and the under-appreciated bedmap.

I found bedmap is more flexiable for my need. I have 48 peak super-enhancer bed files generated by ROSE and want to find peaks that are present in a proportion of the samples.

Have a look of one file:

head M028-P008_C_vs_M028-P008_G_macs1_peaks_Gateway_SuperEnhancers.bed
chr5	135307797	135469385	30_MACS_peak_11809_lociStitched	2	.
chr7	47748031	47878196	18_MACS_peak_13755_lociStitched	3	.
chr2	20723436	20836114	24_MACS_peak_7567_lociStitched	1	.
chr7	129901924	130004106	13_MACS_peak_14530_lociStitched	4	.
chr1	204232472	204333056	16_MACS_peak_1819_lociStitched	14	.
chr20	55350354	55450040	14_MACS_peak_9456_lociStitched	20	.
chr20	55486991	55585604	18_MACS_peak_9466_lociStitched	13	.
chr7	31443763	31540165	14_MACS_peak_13378_lociStitched	7	.
chr15	68865245	68960466	19_MACS_peak_5648_lociStitched	16	.
chr7	47491363	47585935	9_MACS_peak_13733_lociStitched	6	.

Following the link above:

cat *.bed | sort-bed -  | bedmap --count --echo --delim '\t' - | head
2	chr1	838816	879282	4_MACS_peak_3_lociStitched	581	.
21	chr1	859816	976303	11_MACS_peak_2_lociStitched	138	.
10	chr1	890960	919997	4_MACS_peak_5_lociStitched	1163	.
20	chr1	893593	963946	7_MACS_peak_7_lociStitched	33	.
10	chr1	893728	919995	3_MACS_peak_1_lociStitched	1071	.
20	chr1	893728	978529	11_MACS_peak_7_lociStitched	278	.
9	chr1	893749	917890	4_MACS_peak_5_lociStitched	1112	.
20	chr1	893831	976376	9_MACS_peak_6_lociStitched	77	.
20	chr1	894077	969891	14_MACS_peak_8_lociStitched	357	.
20	chr1	913405	963993	4_MACS_peak_7_lociStitched	276	.

Note in this case, the <ref-file> and the <map-file> is the same.

The first column is the number of peaks that region overlaps. So, really the number of that peak overlap with others should be column 1 -1 because a region itself will overlap with itself.

what's the maximum number of overlapping peaks?

cat *.bed | sort-bed -  | bedmap --count --echo --delim '\t' - | cut -f1 | sort -k1,1nr | uniq -c | head
      1 115
      1 99
      1 98
      2 96
      1 94
      1 91
      1 88
      1 84
      1 80
      1 79

I only have 48 samples, so I expect to see the biggest number of 48? No, because a peak in one file can be really big and it can overlap with multiple peaks in another file e.g.

sample 1:     |-----------------------|
sample 2:   |----|    |----|      |-----|

we only have 2 samples in this example, but the single peak in sample2 has 3 peaks overlap with it. The other problem is that if sample1 has two peaks that overlap with each other, it will be counted as well:

sample 1:     |-----------------------|
                                |-------|

sample 2:   |----|    |----|      |-----|

For the large peak in sample1, it can have 4 peaks overlap with it.

add a unique id for each sample in column4

The peaks from ROSE has an id like4_MACS_peak_3_lociStitched which can be shared by mulitple samples, although it is less likely than the peak id from macs14 : MACS_peak_1. The peak calls from macs2 adds the sample name: my_sample_macs2_peaks1.

To make the most from bedops, one needs to add a unique id for each region of each sample.

save the following script as add_name.sh:

set -e
set -u
set -o pipefail

file=$1

fileName=$(echo $1 | sed -E 's/_C_vs_.+//')

cat $file | awk -v id="$fileName" '$4=id"_peak"NR'  OFS="\t" > ${fileName}_add_name_peaks.bed

my file names are like M028-P008_C_vs_M028-P008_G_macs1_peaks_Gateway_SuperEnhancers.bed
this line of code:fileName=$(echo $1 | sed -E 's/_C_vs_.+//') will get M028-P008 as my sample name. change that line accordingly to your own file names.

awk -v id="$fileName" '$4=id"_peak"NR' will change column 4 to a name like: M028-P008_peak1, M028-P008_peak2 ...

change column4 for all samples:

chmod u+x add_names.sh

find *bed | parallel -j 6 './add_name.sh {}'

check the peaks from the same file after altering column 4:

chr5	135307797	135469385	M028-P008_peak1	2	.
chr7	47748031	47878196	M028-P008_peak2	3	.
chr2	20723436	20836114	M028-P008_peak3	1	.
chr7	129901924	130004106	M028-P008_peak4	4	.
chr1	204232472	204333056	M028-P008_peak5	14	.
chr20	55350354	55450040	M028-P008_peak6	20	.
chr20	55486991	55585604	M028-P008_peak7	13	.
chr7	31443763	31540165	M028-P008_peak8	7	.
chr15	68865245	68960466	M028-P008_peak9	16	.
chr7	47491363	47585935	M028-P008_peak10	6	.

we appended the sample name to the column4.

mkdir clean_bed
# move all the newly generated file into this folder
mv *add_name_peaks.bed clean_bed

cd clean_bed

use bedmap again. This time, we use --echo-map-id-uniq option


cat  *add_name_peaks.bed | sort-bed - |  bedmap --echo --echo-map-id-uniq - | awk -F"|" '(split($2, a, ";") >=48)' | wc -l
429

## this is the same as 

cat *add_name_peaks.bed | sort-bed -  | bedmap --count --echo --delim '\t' - | awk '$1 >=48' | wc -l
429

# but now we have more information
# depending on how many columns your bed file has, I splited column 7
cat  *add_name_peaks.bed | sort-bed - |  bedmap --echo --echo-map-id-uniq --delim '\t' - | awk '(split($7, a, ";") >=48)' | head

The last column now has the id of the peaks from each sample separated by ;.

Now, if you really wants to get the number of overlapping peaks in a sample wise manner, you need to parse the last column extracting unique sample ids.

one example:

chr1	9292931	9510959	2549-STC1_peak4	17	.	2173-STC1_peak134;2173-STC1_peak173;2357-STC1_peak5;2379-STC1_peak487;2379-STC1_peak58;2379-STC1_peak757;2549-STC1_peak4;2559-STC1_peak327;2559-STC1_peak926;2698-STC1_peak435;2761-STC1_peak728;2761-STC1_peak8;2765-STC1_peak53;2812-STC1_peak146;2812-STC1_peak257;2812-STC1_peak774;M-12_peak776;M-48_peak11;M-52_peak3;M-57_peak371;M-57_peak525;M-57_peak559;M-61_peak747;M-84_peak77;M-940_peak571;M-96_peak116;M-96_peak663;M-A2058_peak269;M-G361_peak260;M-G361_peak90;M-HS934_peak203;M-WM115_peak300;M-WM165_peak14;M-WM266_peak36;M-WM266_peak663;M028-P008_peak707;M035-P010_peak212;M137-P008_peak620;M137-P008_peak797;M263-P011_peak327;M275-P008_peak177;M275-P008_peak813;M305-P009_peak644;M306-P008_peak377;M306-P008_peak552;M399-P010_peak104;M399-P010_peak480;M409-P011_peak290;M642-P008_peak9;M762-P008_peak470;M762-P008_peak679;M822-P010_peak235;M822-P010_peak553;M827-P011_peak261;M852-P008_peak291;M852-P008_peak428

remove the suffix _peak4 etc for column7 (requires gawk) https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

cat  *add_name_peaks.bed | sort-bed - |  bedmap --echo --echo-map-id-uniq --delim '\t' - | awk '(split($7, a, ";") >=48)' | head -1 | gawk -v OFS="\t" '{a=gensub(/_peak[0-9]+/, "", "g",  $7); $7=a; print}'
chr1	2089095	2255972	M-48_peak15	24	.	2173-STC1;2173-STC1;2357-STC1;2379-STC1;2379-STC1;2549-STC1;2559-STC1;2559-STC1;2761-STC1;2761-STC1;2765-STC1;2792-STC1;2792-STC1;2812-STC1;2812-STC1;M-22;M-48;M-52;M-52;M-57;M-57;M-61;M-61;M-84;M-940;M-940;M-96;M-96;M-A2058;M-A2058;M-G361;M-G361;M-HS934;M-HS934;M-WM165;M-WM266;M-WM266;M028-P008;M035-P010;M137-P008;M263-P011;M263-P011;M275-P008;M306-P008;M357-P010;M357-P010;M399-P010;M399-P010;M527-P010;M527-P010;M642-P008;M721-P010;M721-P010;M749-P010;M749-P010;M762-P008;M762-P008;M822-P010;M822-P010;M852-P008

you see the last column is now only containing the sample names.

The awk commands becoming awkward. it maybe a good time to switch to python/R for more text manipulation. let me stick with unix commands for now :)

The next step is to get the unique sample count from column 7.

Final touch up for all files

cat  *add_name_peaks.bed | sort-bed - |  bedmap --echo --echo-map-id-uniq --delim '\t' - | awk '(split($7, a, ";") >=48)' | gawk -v OFS="\t" '{a=gensub(/_peak[0-9]+/, "", "g",  $7); $7=a; print}' > merged_super.bed 

while read line;do cnt=`echo "$line" | cut -f7 | tr ";" "\n" | sort -u | wc -l`;echo -e "$line\t$cnt";done < merged_super.bed > merged_super_clean.bed

or the R way:

library(stringr)
library(tidyverse)

bed<- read_tsv("merged_super.bed", col_names = F)

bed %>% mutate(X8 = unlist(lapply(str_split(X7, ";"), function(x) length(unique(x))))) %>% write.table("merged_super_clean.bed", quote =F, sep = "\t", row.names =F, col.names =F)

select super-enhancers present in 40 out of 48 samples

you can change the cut-off based on how many you want to validate.

cat merged_super_clean.bed| awk '$8 >= 40' | sort-bed - | bedtools merge -i - | wc -l

35

cat merged_super_clean.bed| awk '$8 >= 40' | sort-bed - | bedtools merge -i - > melanoma_super_40_sample.bed

write shell scripts for the whole process.

since I may do this for multiple times, I put the shell scripts together require bedtools and bedops in your path.

add unique name to the peak file to the fourth column put the bed files in to the same folder and do

find *sorted.bed | parallel '../../../scripts/add_uinque_name_for_peak.sh {} _macs1_peaks_Gateway_SuperEnhancers.sorted.bed'

add_uinque_name_for_peak.sh:

#! /bin/bash

set -euo pipefail

file=$1
#the pattern you want to remove from the end of the file name
pattern=$2

sampleName=${file%${pattern}}

## add unique names for each peak for each sample

cat $file | cut -f1-3|  awk -v id="$sampleName" '$4=id"_peak"NR'  OFS="\t" > ${sampleName}_add_name_peaks.bed

add the number of samples each peak present to the sixth column.

get_peaks_present_in_a_subset_of_samples.sh:

#! /bin/bash

set -euo pipefail

# see https://www.unix.com/unix-for-beginners-questions-and-answers/270526-awk-count-unique-element-array.html

cat  *add_name_peaks.bed \
| sort-bed - \
| bedmap --echo --echo-map-id-uniq --delim '\t' - \
| gawk -v OFS="\t" '{a=gensub(/_peak[0-9]+/, "", "g",  $5); $5=a; print}' \
| awk '{split(x,C); n=split($5,F,/;/); for(i in F) if(C[F[i]]++) n--; print $0 "\t" n}'

The output can be futher filtered by e.g awk '$6 >=10' to filter down the peaks present in at least 10 samples. The output then can be piped to bedtools merge -i to merge to get your final peak set.

annotate peaks with ChIPseeker

Because the super-enhancer size is usually big, ~100kb. we need to report all genes that overlap with it using addFlankGeneInfo.

see YuLab-SMU/ChIPseeker#55

library(ChIPseeker)
library(rtracklayer)

bed<- import("melanoma_super_40_sample.bed", format= "BED")

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## see https://github.com/GuangchuangYu/ChIPseeker/issues/55
#
super_anno <- annotatePeak(bed, tssRegion=c(-3000, 3000), 
                         TxDb=txdb, annoDb="org.Hs.eg.db", level = "gene", overlap = "TSS",
                          addFlankGeneInfo = TRUE,  flankDistance = 5000)

super_anno<-super_anno %>% 
  as.data.frame()%>% 
  dplyr::select(seqnames,start, end, geneId, distanceToTSS, SYMBOL, flank_geneIds, flank_gene_distances) %>%
  mutate(flank_geneIds =strsplit(as.character(flank_geneIds), ";")) %>% 
  mutate(flank_gene_distances =strsplit(as.character(flank_gene_distances), ";")) %>%
  unnest(flank_geneIds, flank_gene_distances) 

## get the gene symbol for the flank genes.
super_flank_gene_symbols<- AnnotationDbi::select(org.Hs.eg.db, keys=super_anno$flank_geneIds, 
                                    columns="SYMBOL", keytype="ENTREZID") %>%
  dplyr::rename(flank_symbol= SYMBOL)

super_anno<- cbind(super_anno, super_flank_gene_symbols)


write.table(super_anno,  "melanoma_super_40samples_anno.tsv", quote =F, sep = "\t", row.names =F, col.names = T)

Sunday, March 25, 2018

Three gotchas when using R for Genomic data analysis

Three gotchas when using R for Genomic data analysis

During my daily work with R for genomic data analysis, I encountered several instances that R gives me some (bad) surprises.

1. The devil 1 and 0 coordinate system

read detail here https://github.com/crazyhottommy/DNA-seq-analysis#tips-and-lessons-learned-during-my-dna-seq-data-analysis-journey

some files such as bed file is 0 based. Two genomic regions:

chr1    0    1000
chr1    1000    2000

when you import that bed file into R using rtracklayer::import(), it will become

chr1     1    1000
chr1    1001    2000

The function converts it to 1 based internally (R is 1 based unlike python).

The problem is that when you read the bed file with read.table and use GenomicRanges::makeGRangesFromDataFrame() to convert it to a GRanges object, do not forget to add 1 to the start before doing it.

Similarily, when you write a GRanges object to disk using rtracklayer::export, you do not need to worry, R will convert it back to 0 based in file.

However, if you make a dataframe out of the GRanges object, you need to remember do start -1 before writing to a file.

2. read_tsv column types

If you use read_tsv from readr, it will use the first 1000 rows to determine the column types (integer, charater etc). For genomic data, however, especially for the chromosome column, you may or may not have chr prepending.

1    0    1000    
1    1000    2000
.
.
.
X
Y
MT

you may fail to read rows for chromosome X, Y and MT. (To make things worse, UCSC uses chrM rather than chrMT...)

The solution is that read in all the data as characters.

library(readr)
challenge2 <- read_tsv("my.bed", 
  col_types = cols(.default = col_character())
)

see http://r4ds.had.co.nz/data-import.html

3. Scientific notation for genomic coordinates

This is kind of related to 2. 1200000 will be written as 1.2e6 in a dataframe if R thinks it is an integer. So, you will need to read in the columns all as characters, or if you convert the character to numeric and wants to write to a file, add options(scipen=500) on the top of your script.

The scientific notation can not be disabled in write_csv: tidyverse/readr#671

one more gotcha with colnames

Base R will change the - to . for the column names. e.g. TCGA=06-ABCD will be changed to TCGA.06.ABCD. This can cause troubles when you use the colnames to match samples. readr will maintain the -.

Tuesday, February 6, 2018

convert a human gmt file to mouse for GSEA

A function to convert

The script was copied from http://genomespot.blogspot.com/2016/12/msigdb-gene-sets-for-mouse.html by Mark Ziemann.

save the below script to convert_gmt_perline.sh:

This function translate the human gene symbol to mouse gene symbol for each line.

#!/bin/bash

# author: Mark Ziemann
# modified by: Ming Tang

set -eu
set -o pipefail


line=$1
  NAME_DESC=`echo $line | cut -d ' ' -f-2`

  GENES=`echo $line | cut -d ' ' -f3- \
  | tr ' ' '\n' | sort -uk 1b,1 \
  | join -1 1 -2 1 - \
  <(cut -f3,5 mouse2hum_biomart_ens87.txt \
  | sed 1d | awk '$1!="" && $2!=""' \
  | sort -uk 1b,1) | cut -d ' ' -f2 \
  | sort -u | tr '\n' '\t' \
  | sed 's/\t$/\n/'`

  echo $NAME_DESC $GENES | tr ' ' '\t'

A mouse to human gene mapping file

Downloaded from the same site. https://gist.github.com/crazyhottommy/dde3c10cc367df715ef5c7c5f353b5f5

less -S mouse2hum_biomart_ens87.txt | head | csvtk csv2md -t

Gene ID Transcript ID Human associated gene name Human gene stable ID Associated Gene Name
ENSMUSG00000064336 ENSMUST00000082387 mt-Tf
ENSMUSG00000064337 ENSMUST00000082388 mt-Rnr1
ENSMUSG00000064338 ENSMUST00000082389 mt-Tv
ENSMUSG00000064339 ENSMUST00000082390 mt-Rnr2
ENSMUSG00000064340 ENSMUST00000082391 mt-Tl1
ENSMUSG00000064341 ENSMUST00000082392 MT-ND1 ENSG00000198888 mt-Nd1
ENSMUSG00000064342 ENSMUST00000082393 mt-Ti
ENSMUSG00000064343 ENSMUST00000082394 mt-Tq
ENSMUSG00000064344 ENSMUST00000082395 mt-Tm

Download the human msigDB sets and convert

download from http://software.broadinstitute.org/gsea/msigdb/collections.jsp#H

less -S h.all.v6.1.symbols.gmt
cat h.all.v6.1.symbols.gmt | parallel -k -j 5 './convert_gmt_perline.sh {}' > h.all.v6.1.symbols_mouse.gmt

use GSEA pre-rank

As I mentioned in my previous blog post, one can run GSEA in two modes:

  1. input the raw expression level of all samples.
  2. pre-rank the genes and use this ranked gene list for GSEA.

Even if you go with option 1, GSEA internally rank the gene list by some metrics. The default is signal to noise ratio: (uA - uB)/sigmaA + sigmaB.

As GSEA was designed for array data, for RNAseq data, what value (RPKM, rlog counts, or TPM?) to input to GSEA is still not well understood. see Q&A from GSEA. They recommend using the pre-rank method for RNAseq data.

To generate a pre-ranked gene list(for RNAseq, go from raw counts to DESeq2):

library(DESeq2)

dds <- DESeqDataSetFromMatrix(countData = countmat,
                              colData = coldata,
                              design = ~ condition)
dds
dds <- DESeq(dds)


## I have 20 samples for each condition: knockout gene X, and WT gene X
## KO vs WT
dds$condition <- factor(dds$condition, levels = c("KO","WT"))

res <- results(dds, contrast=c("condition", "KO", "WT"))
resultsNames(dds)

res$ENSEMBL<- rownames(res)

# I used ENSEMBL gene id for the matrix, and annotate with gene symbols.
res_anno<- as.data.frame(res) %>% left_join(gene_symbols, c("ENSEMBL" = "gene_id"))

res_pre_rank<- sign(res_anno$log2FoldChange) * -log10(res_anno$pvalue)

# you will need the gene symbol, and suffix the file with rnk for GSEA to recognize
write_tsv(data.frame(Name = res_anno$gene_name, metric = res_pre_rank), "KO_vs_WT_pre_rank.rnk")

Now, you can load the rnk file into GSEA and do the analysis.

Intepretation of the GSEA figure

We rank the genes by sign of the fold change times the p-value (we told DESeq2 to compare KO vs WT, if a fold change is positive, that means the gene is up-regulated in KO), so the genes on the top (or left) is the genes with higher expression value in the KO group, while the genes on the bottom (or right) are the genes with lower expression value in KO group.

How to read the figure ?

The X-axis is all your genes in the expriment (~ 20,000 in this case) pre-ranked by your metric. each black bar is the gene in this gene set(pathway). You have an idea where are the genes located in the pre-ranked list.

Enrichement Score is calcuated by some metric that ES is positive if the gene set is located in the top of the pre-ranked gene list. ES is negative if the gene set is located in the bottom of the pre-ranked gene list.

We see glycolisis is positively co-related with KO.

IL6_JAK_SATA3 signaling is positively co-related with WT.

Wednesday, January 24, 2018

ATACseq contamination of mycoplasma DNA

ATACseq contamination of mycoplasma DNA

I was doing ATACseq analysis for my labmates, and found the mapping rate is very low (~5%) to human genome. I went to talk with Elias who performed the experiment and confirmed that it is human cancer cell lines.

One of the problem that I knew is that it could be mycoplasma contaminations. I still remember the days when I cultured MCF7 breast cancer cell line during my PhD, I saw some black particles in the dish. later, I got to know it was mycoplasma (bacterial). The problem is that once you have it in the lab, they are very hard to get rid of. We had to use plasmocin to treat the cells.

To verify my postulations, I did one experiment below.

Download the mycoplasma genomes

This paper Mycoplasma contamination in the 1000 Genomes Project has looked into the mycoplasma contamination in 1000Genome project. They used 33 mycoplasma genomes:

Additional File 1
Mycoplasma Genomes Used
All the Mycoplasma genomes on FTP site ftp.ncbi.nih.gov files genomes/Bacteria/Mycoplasma * were
down loaded from (30 files, 24 November 2011) and incorporated into a Bowtie EBWT database and a
colorspace database.
Table S1: Thirty species of Mycoplasma whose Genomes were used
Genome fasta description
gi|148377268|ref|NC 009497.1|
gi|291319937|ref|NC 013948.1|
gi|193082772|ref|NC 011025.1|
gi|339320528|ref|NC 015725.1|
gi|313678134|ref|NC 014760.1|
gi|83319253|ref|NC 007633.1|
gi|240047135|ref|NC 012806.1|
gi|294155300|ref|NC 014014.1|
gi|308189587|ref|NC 014552.1|
gi|319776738|ref|NC 014921.1|
gi|294660180|ref|NC 004829.2|
gi|108885074|ref|NC 000908.2|
gi|321309518|ref|NC 014970.1|
gi|269114774|ref|NC 013511.1|
gi|54019969|ref|NC 006360.1|
gi|72080342|ref|NC 007332.1|
gi|71893359|ref|NC 007295.1|
gi|304372805|ref|NC 014448.1|
gi|313664890|ref|NC 014751.1|
gi|47458835|ref|NC 006908.1|
gi|330370665|ref|NC 015407.1|
gi|331703020|ref|NC 015431.1|
gi|127763381|ref|NC 005364.2|
gi|26553452|ref|NC 004432.1|
gi|13507739|ref|NC 000912.1|
gi|15828471|ref|NC 002771.1|
gi|344204770|ref|NC 015946.1|
gi|325972867|ref|NC 015155.1|
gi|325989358|ref|NC 015153.1|
gi|71894025|ref|NC 007294.1|

It turns out NCBI has archived those data to https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#oldcontent

The content of most of the old directories on the ftp://ftp.ncbi.nlm.nih.gov/genomes/ site, and the content previously at ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ is no longer being updated. Many old directories from these two areas were moved to archival subdirectories within the /genomes/ area on 2 December 2015. More old directories will be moved to the archive in 2017. Details of what FTP directories and files were moved are as follows.

    All directories and files from ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank
    The following directories from ftp://ftp.ncbi.nlm.nih.gov/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/
        Aedes_aegypti
        Anopheles_gambiae
        Arabidopsis_lyrata
        Arabidopsis_thaliana
        ASSEMBLY_BACTERIA
        Bacteria
        Bacteria_DRAFT
        Branchiostoma_floridae
        Caenorhabditis_elegans
        Chloroplasts
        CLUSTERS
        Drosophila_melanogaster
        Drosophila_pseudoobscura
        Fungi
        Medicago_truncatula
        MITOCHONDRIA
        Physcomitrella_patens
        PLANTS
        Plasmids
        Populus_trichocarpa
        Protozoa
        Sorghum_bicolor

Download the fasta file for all species

## need to specify the full path for Mycoplasma_* folder
mkdir mycoplasma_index
cd mycoplasma_index

wget -r --include-directories="genomes/archive/old_refseq/Bacteria/Mycoplasma_*" --no-parent -nH --cut-dir=5 -A "*fna" ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/

ls 

NC_000908.fna  NC_007332.fna  NC_014751.fna  NC_016638.fna  NC_018077.fna  NC_018495.fna  NC_021283.fna
NC_000912.fna  NC_007633.fna  NC_014760.fna  NC_016807.fna  NC_018149.fna  NC_018496.fna  NC_021831.fna
NC_002771.fna  NC_009497.fna  NC_014921.fna  NC_016829.fna  NC_018406.fna  NC_018497.fna  NC_022575.fna
NC_004432.fna  NC_011025.fna  NC_014970.fna  NC_017502.fna  NC_018407.fna  NC_018498.fna  NC_022807.fna
NC_004829.fna  NC_012806.fna  NC_015153.fna  NC_017503.fna  NC_018408.fna  NC_019552.fna  NC_023030.fna
NC_005364.fna  NC_013511.fna  NC_015155.fna  NC_017504.fna  NC_018409.fna  NC_019949.fna  NC_023062.fna
NC_006360.fna  NC_013948.fna  NC_015407.fna  NC_017509.fna  NC_018410.fna  NC_020076.fna
NC_006908.fna  NC_014014.fna  NC_015431.fna  NC_017519.fna  NC_018411.fna  NC_021002.fna
NC_007294.fna  NC_014448.fna  NC_015725.fna  NC_017520.fna  NC_018412.fna  NC_021025.fna
NC_007295.fna  NC_014552.fna  NC_015946.fna  NC_017521.fna  NC_018413.fna  NC_021083.fna

## there are 66 files
ls -1 | wc -l
66

build bowtie2 index

cat *fna > mycoplasma.fa

## small genomes, should finish within minutes
bowtie2-build mycoplasma.fa mycoplasma

mapping reads to Mycoplasma Genomes

I then run through my ATACseq snakemake pipeline, changing the config.yaml for the reference genome.

After re-aligning with the mycoplasma genome, I checked the mapping rate.

Astonishing mapping rate !!

cd 00log
cat *align | grep overall

55.81% overall alignment rate
89.93% overall alignment rate
52.83% overall alignment rate
84.31% overall alignment rate
87.33% overall alignment rate

Lessons learned

For cultured cells, it is very common to have mycoplasma contamination. It is critical to treat the cells with antibiotics, otherwise, you sequencing experiments will have a lot of bacterial sequences.

Not only for ATACseq, other sequencings such as RNAseq, WES, WGS etc can be affected as well.

I am thinking to add checking mycoplasma contamination to my snakemake pipelines. fastq_screen seems to be a good one https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

Tuesday, January 2, 2018

cancer gene census copy number

Set up

library(knitr)
library(here)
## here() starts at /Users/mtang1/projects/mixing_histology_lung_cancer
root.dir<- here()
opts_knit$set(root.dir = root.dir)

read in the data

I am going to check the COSMIC database on cancer genes. Specifically, I want to know which cancer-related genes are found amplified and which are deleted.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(janitor)
library(stringr)
cancer_gene_census<- read_csv("data/COSMIC_Cancer_gene_Census/cancer_gene_census.csv", col_names = T)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Entrez GeneId` = col_integer(),
##   Tier = col_integer()
## )
## See spec(...) for full column specifications.

tidy the data with janitor::clean_names(), dplyr::unnest

The column names have spaces, and in many columns there are mulitple strings separated by ,.
## dplyr after 0.7.0 use pull to get one column out as a vector, I was using .$
# https://stackoverflow.com/questions/27149306/dplyrselect-one-column-and-output-as-vector

#cancer_gene_census %>% pull(`Gene Symbol`) %>% unique()
#cancer_gene_census %>% distinct(`Gene Symbol`)

## extract genes that are amplified or deleted in cancers. in the `mutation types` column search A for amplificaton and D for large deletion.

## use janitor::clean_names() to clean the column names with space.
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% tabyl(mutation_type)
##    mutation_type   n      percent valid_percent
## 1              N   1 0.0008257638  0.0008264463
## 2              A   8 0.0066061107  0.0066115702
## 3              D   8 0.0066061107  0.0066115702
## 4              F 139 0.1147811726  0.1148760331
## 5            Mis  68 0.0561519405  0.0561983471
## 6              N 147 0.1213872832  0.1214876033
## 7              O  37 0.0305532618  0.0305785124
## 8              S  76 0.0627580512  0.0628099174
## 9              T  19 0.0156895128  0.0157024793
## 10             A  18 0.0148637490  0.0148760331
## 11             D  35 0.0289017341  0.0289256198
## 12             F  32 0.0264244426  0.0264462810
## 13        F; Mis   1 0.0008257638  0.0008264463
## 14             M   3 0.0024772915  0.0024793388
## 15           Mis 240 0.1981833196  0.1983471074
## 16        Mis. N   1 0.0008257638  0.0008264463
## 17             N  24 0.0198183320  0.0198347107
## 18             O   7 0.0057803468  0.0057851240
## 19  Promoter Mis   1 0.0008257638  0.0008264463
## 20             S   2 0.0016515277  0.0016528926
## 21             T 343 0.2832369942  0.2834710744
## 22          <NA>   1 0.0008257638            NA
It turns out that single mutation_types column is more messy than you think… Multiple entries of A and D in different rows. spaces in the column are the devil.

trim the spaces with stringr::str_trim()

# trim the white space
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% tabyl(mutation_type)
##   mutation_type  n   percent
## 1             A 26 0.3768116
## 2             D 43 0.6231884
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% count(mutation_type)
## # A tibble: 2 x 2
##   mutation_type     n
##           <chr> <int>
## 1             A    26
## 2             D    43

Sanity check

according to the website http://cancer.sanger.ac.uk/cosmic/census?genome=37#cl_sub_tables there are should be 40 deletions and 24 amplifications while I am getting 43 and 26, respectively.
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A") 
## # A tibble: 26 x 21
##    gene_symbol
##          <chr>
##  1        AKT2
##  2        AKT3
##  3         ALK
##  4       CCNE1
##  5      DROSHA
##  6        EGFR
##  7       ERBB2
##  8         ERG
##  9        FLT4
## 10        GRM3
## # ... with 16 more rows, and 20 more variables: name <chr>,
## #   entrez_geneid <int>, genome_location <chr>, tier <int>,
## #   hallmark <chr>, chr_band <chr>, somatic <chr>, germline <chr>,
## #   tumour_types_somatic <chr>, tumour_types_germline <chr>,
## #   cancer_syndrome <chr>, tissue_type <chr>, molecular_genetics <chr>,
## #   role_in_cancer <chr>, mutation_types <chr>,
## #   translocation_partner <chr>, other_germline_mut <chr>,
## #   other_syndrome <chr>, synonyms <chr>, mutation_type <chr>
I checked by eyes, AKT3 and GRM3 are missing from the table online http://cancer.sanger.ac.uk/cosmic/census/tables?name=amp but when I checked the downloaded table, both AKT3 and GRM3 has Amplification in the mutation types column. I am bit confused, but for now I will stick to the downloaded table.
It teaches me an important lesson.
  1. Do not trust data (blindly) from any resource even COSMIC.
  2. Data are always messy. Tidying data is the big job.
  3. Becareful with the data, check with sanity. Browse the data with your eyes may reveal some unexpected things.

oncogenes are amplified and tumor suppressor genes are always deleted?

In cancer, we tend to think oncogenes are amplified and tumor suppressor genes are deleted to promote tumor progression. Is it true in this setting?
#devtools::install_github("haozhu233/kableExtra")
library(kableExtra)
library(knitr)
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% 
        mutate(role = strsplit(as.character(role_in_cancer), ",")) %>%
        unnest(role) %>%
        mutate(role = str_trim(role)) %>%
        filter(role == "TSG" | role == "oncogene") %>%
        count(role, mutation_type) %>% kable()
rolemutation_typen
oncogeneA25
oncogeneD7
TSGA3
TSGD42

what are those genes?

cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% 
        mutate(role = strsplit(as.character(role_in_cancer), ",")) %>%
        unnest(role) %>%
        mutate(role = str_trim(role)) %>%
        filter(role == "TSG" | role == "oncogene") %>%
        filter((role == "TSG" & mutation_type == "A") | (role == "oncogene" & mutation_type == "D")) %>%
        select(gene_symbol, role_in_cancer, role, mutation_types, mutation_type) %>%
        kable(format = "html", booktabs = T, caption = "TSG and oncogene") %>%
        kable_styling(latex_options = c("striped", "hold_position"),
                full_width = F)
TSG and oncogene
gene_symbolrole_in_cancerrolemutation_typesmutation_type
APOBEC3Boncogene, TSGoncogeneDD
BIRC3oncogene, TSG, fusiononcogeneD, F, N, T, MisD
DROSHATSGTSGA, Mis, N, FA
GPC3oncogene, TSGoncogeneD, Mis, N, F, SD
KDM6Aoncogene, TSGoncogeneD, N, F, SD
MAP2K4oncogene, TSGoncogeneD, Mis, ND
NKX2-1oncogene, TSGTSGAA
NTRK1oncogene, TSG, fusionTSGT, AA
PAX5oncogene, TSG, fusiononcogeneT, Mis, D, F, SD
WT1oncogene, TSG, fusiononcogeneD, Mis, N, F, S, TD
majority of the genes from the output can function as either oncogene or tumor suppressor genes (TSG), which is not totally suprising. Cancer is such a complex disease that a function of the gene is dependent on the context (e.g. cancer types). Interestingly, DROSHA is the only TSG and is amplified.