Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, January 24, 2018

ATACseq contamination of mycoplasma DNA

ATACseq contamination of mycoplasma DNA

I was doing ATACseq analysis for my labmates, and found the mapping rate is very low (~5%) to human genome. I went to talk with Elias who performed the experiment and confirmed that it is human cancer cell lines.

One of the problem that I knew is that it could be mycoplasma contaminations. I still remember the days when I cultured MCF7 breast cancer cell line during my PhD, I saw some black particles in the dish. later, I got to know it was mycoplasma (bacterial). The problem is that once you have it in the lab, they are very hard to get rid of. We had to use plasmocin to treat the cells.

To verify my postulations, I did one experiment below.

Download the mycoplasma genomes

This paper Mycoplasma contamination in the 1000 Genomes Project has looked into the mycoplasma contamination in 1000Genome project. They used 33 mycoplasma genomes:

Additional File 1
Mycoplasma Genomes Used
All the Mycoplasma genomes on FTP site ftp.ncbi.nih.gov files genomes/Bacteria/Mycoplasma * were
down loaded from (30 files, 24 November 2011) and incorporated into a Bowtie EBWT database and a
colorspace database.
Table S1: Thirty species of Mycoplasma whose Genomes were used
Genome fasta description
gi|148377268|ref|NC 009497.1|
gi|291319937|ref|NC 013948.1|
gi|193082772|ref|NC 011025.1|
gi|339320528|ref|NC 015725.1|
gi|313678134|ref|NC 014760.1|
gi|83319253|ref|NC 007633.1|
gi|240047135|ref|NC 012806.1|
gi|294155300|ref|NC 014014.1|
gi|308189587|ref|NC 014552.1|
gi|319776738|ref|NC 014921.1|
gi|294660180|ref|NC 004829.2|
gi|108885074|ref|NC 000908.2|
gi|321309518|ref|NC 014970.1|
gi|269114774|ref|NC 013511.1|
gi|54019969|ref|NC 006360.1|
gi|72080342|ref|NC 007332.1|
gi|71893359|ref|NC 007295.1|
gi|304372805|ref|NC 014448.1|
gi|313664890|ref|NC 014751.1|
gi|47458835|ref|NC 006908.1|
gi|330370665|ref|NC 015407.1|
gi|331703020|ref|NC 015431.1|
gi|127763381|ref|NC 005364.2|
gi|26553452|ref|NC 004432.1|
gi|13507739|ref|NC 000912.1|
gi|15828471|ref|NC 002771.1|
gi|344204770|ref|NC 015946.1|
gi|325972867|ref|NC 015155.1|
gi|325989358|ref|NC 015153.1|
gi|71894025|ref|NC 007294.1|

It turns out NCBI has archived those data to https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#oldcontent

The content of most of the old directories on the ftp://ftp.ncbi.nlm.nih.gov/genomes/ site, and the content previously at ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ is no longer being updated. Many old directories from these two areas were moved to archival subdirectories within the /genomes/ area on 2 December 2015. More old directories will be moved to the archive in 2017. Details of what FTP directories and files were moved are as follows.

    All directories and files from ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank
    The following directories from ftp://ftp.ncbi.nlm.nih.gov/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/
        Aedes_aegypti
        Anopheles_gambiae
        Arabidopsis_lyrata
        Arabidopsis_thaliana
        ASSEMBLY_BACTERIA
        Bacteria
        Bacteria_DRAFT
        Branchiostoma_floridae
        Caenorhabditis_elegans
        Chloroplasts
        CLUSTERS
        Drosophila_melanogaster
        Drosophila_pseudoobscura
        Fungi
        Medicago_truncatula
        MITOCHONDRIA
        Physcomitrella_patens
        PLANTS
        Plasmids
        Populus_trichocarpa
        Protozoa
        Sorghum_bicolor

Download the fasta file for all species

## need to specify the full path for Mycoplasma_* folder
mkdir mycoplasma_index
cd mycoplasma_index

wget -r --include-directories="genomes/archive/old_refseq/Bacteria/Mycoplasma_*" --no-parent -nH --cut-dir=5 -A "*fna" ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/

ls 

NC_000908.fna  NC_007332.fna  NC_014751.fna  NC_016638.fna  NC_018077.fna  NC_018495.fna  NC_021283.fna
NC_000912.fna  NC_007633.fna  NC_014760.fna  NC_016807.fna  NC_018149.fna  NC_018496.fna  NC_021831.fna
NC_002771.fna  NC_009497.fna  NC_014921.fna  NC_016829.fna  NC_018406.fna  NC_018497.fna  NC_022575.fna
NC_004432.fna  NC_011025.fna  NC_014970.fna  NC_017502.fna  NC_018407.fna  NC_018498.fna  NC_022807.fna
NC_004829.fna  NC_012806.fna  NC_015153.fna  NC_017503.fna  NC_018408.fna  NC_019552.fna  NC_023030.fna
NC_005364.fna  NC_013511.fna  NC_015155.fna  NC_017504.fna  NC_018409.fna  NC_019949.fna  NC_023062.fna
NC_006360.fna  NC_013948.fna  NC_015407.fna  NC_017509.fna  NC_018410.fna  NC_020076.fna
NC_006908.fna  NC_014014.fna  NC_015431.fna  NC_017519.fna  NC_018411.fna  NC_021002.fna
NC_007294.fna  NC_014448.fna  NC_015725.fna  NC_017520.fna  NC_018412.fna  NC_021025.fna
NC_007295.fna  NC_014552.fna  NC_015946.fna  NC_017521.fna  NC_018413.fna  NC_021083.fna

## there are 66 files
ls -1 | wc -l
66

build bowtie2 index

cat *fna > mycoplasma.fa

## small genomes, should finish within minutes
bowtie2-build mycoplasma.fa mycoplasma

mapping reads to Mycoplasma Genomes

I then run through my ATACseq snakemake pipeline, changing the config.yaml for the reference genome.

After re-aligning with the mycoplasma genome, I checked the mapping rate.

Astonishing mapping rate !!

cd 00log
cat *align | grep overall

55.81% overall alignment rate
89.93% overall alignment rate
52.83% overall alignment rate
84.31% overall alignment rate
87.33% overall alignment rate

Lessons learned

For cultured cells, it is very common to have mycoplasma contamination. It is critical to treat the cells with antibiotics, otherwise, you sequencing experiments will have a lot of bacterial sequences.

Not only for ATACseq, other sequencings such as RNAseq, WES, WGS etc can be affected as well.

I am thinking to add checking mycoplasma contamination to my snakemake pipelines. fastq_screen seems to be a good one https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

Tuesday, January 2, 2018

cancer gene census copy number

Set up

library(knitr)
library(here)
## here() starts at /Users/mtang1/projects/mixing_histology_lung_cancer
root.dir<- here()
opts_knit$set(root.dir = root.dir)

read in the data

I am going to check the COSMIC database on cancer genes. Specifically, I want to know which cancer-related genes are found amplified and which are deleted.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(janitor)
library(stringr)
cancer_gene_census<- read_csv("data/COSMIC_Cancer_gene_Census/cancer_gene_census.csv", col_names = T)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Entrez GeneId` = col_integer(),
##   Tier = col_integer()
## )
## See spec(...) for full column specifications.

tidy the data with janitor::clean_names(), dplyr::unnest

The column names have spaces, and in many columns there are mulitple strings separated by ,.
## dplyr after 0.7.0 use pull to get one column out as a vector, I was using .$
# https://stackoverflow.com/questions/27149306/dplyrselect-one-column-and-output-as-vector

#cancer_gene_census %>% pull(`Gene Symbol`) %>% unique()
#cancer_gene_census %>% distinct(`Gene Symbol`)

## extract genes that are amplified or deleted in cancers. in the `mutation types` column search A for amplificaton and D for large deletion.

## use janitor::clean_names() to clean the column names with space.
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% tabyl(mutation_type)
##    mutation_type   n      percent valid_percent
## 1              N   1 0.0008257638  0.0008264463
## 2              A   8 0.0066061107  0.0066115702
## 3              D   8 0.0066061107  0.0066115702
## 4              F 139 0.1147811726  0.1148760331
## 5            Mis  68 0.0561519405  0.0561983471
## 6              N 147 0.1213872832  0.1214876033
## 7              O  37 0.0305532618  0.0305785124
## 8              S  76 0.0627580512  0.0628099174
## 9              T  19 0.0156895128  0.0157024793
## 10             A  18 0.0148637490  0.0148760331
## 11             D  35 0.0289017341  0.0289256198
## 12             F  32 0.0264244426  0.0264462810
## 13        F; Mis   1 0.0008257638  0.0008264463
## 14             M   3 0.0024772915  0.0024793388
## 15           Mis 240 0.1981833196  0.1983471074
## 16        Mis. N   1 0.0008257638  0.0008264463
## 17             N  24 0.0198183320  0.0198347107
## 18             O   7 0.0057803468  0.0057851240
## 19  Promoter Mis   1 0.0008257638  0.0008264463
## 20             S   2 0.0016515277  0.0016528926
## 21             T 343 0.2832369942  0.2834710744
## 22          <NA>   1 0.0008257638            NA
It turns out that single mutation_types column is more messy than you think… Multiple entries of A and D in different rows. spaces in the column are the devil.

trim the spaces with stringr::str_trim()

# trim the white space
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% tabyl(mutation_type)
##   mutation_type  n   percent
## 1             A 26 0.3768116
## 2             D 43 0.6231884
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% count(mutation_type)
## # A tibble: 2 x 2
##   mutation_type     n
##           <chr> <int>
## 1             A    26
## 2             D    43

Sanity check

according to the website http://cancer.sanger.ac.uk/cosmic/census?genome=37#cl_sub_tables there are should be 40 deletions and 24 amplifications while I am getting 43 and 26, respectively.
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A") 
## # A tibble: 26 x 21
##    gene_symbol
##          <chr>
##  1        AKT2
##  2        AKT3
##  3         ALK
##  4       CCNE1
##  5      DROSHA
##  6        EGFR
##  7       ERBB2
##  8         ERG
##  9        FLT4
## 10        GRM3
## # ... with 16 more rows, and 20 more variables: name <chr>,
## #   entrez_geneid <int>, genome_location <chr>, tier <int>,
## #   hallmark <chr>, chr_band <chr>, somatic <chr>, germline <chr>,
## #   tumour_types_somatic <chr>, tumour_types_germline <chr>,
## #   cancer_syndrome <chr>, tissue_type <chr>, molecular_genetics <chr>,
## #   role_in_cancer <chr>, mutation_types <chr>,
## #   translocation_partner <chr>, other_germline_mut <chr>,
## #   other_syndrome <chr>, synonyms <chr>, mutation_type <chr>
I checked by eyes, AKT3 and GRM3 are missing from the table online http://cancer.sanger.ac.uk/cosmic/census/tables?name=amp but when I checked the downloaded table, both AKT3 and GRM3 has Amplification in the mutation types column. I am bit confused, but for now I will stick to the downloaded table.
It teaches me an important lesson.
  1. Do not trust data (blindly) from any resource even COSMIC.
  2. Data are always messy. Tidying data is the big job.
  3. Becareful with the data, check with sanity. Browse the data with your eyes may reveal some unexpected things.

oncogenes are amplified and tumor suppressor genes are always deleted?

In cancer, we tend to think oncogenes are amplified and tumor suppressor genes are deleted to promote tumor progression. Is it true in this setting?
#devtools::install_github("haozhu233/kableExtra")
library(kableExtra)
library(knitr)
cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% 
        mutate(role = strsplit(as.character(role_in_cancer), ",")) %>%
        unnest(role) %>%
        mutate(role = str_trim(role)) %>%
        filter(role == "TSG" | role == "oncogene") %>%
        count(role, mutation_type) %>% kable()
rolemutation_typen
oncogeneA25
oncogeneD7
TSGA3
TSGD42

what are those genes?

cancer_gene_census %>% 
        clean_names() %>%
        mutate(mutation_type =strsplit(as.character(mutation_types), ",")) %>% 
        unnest(mutation_type) %>% 
        mutate(mutation_type = str_trim(mutation_type)) %>%
        filter(mutation_type == "A" | mutation_type == "D") %>% 
        mutate(role = strsplit(as.character(role_in_cancer), ",")) %>%
        unnest(role) %>%
        mutate(role = str_trim(role)) %>%
        filter(role == "TSG" | role == "oncogene") %>%
        filter((role == "TSG" & mutation_type == "A") | (role == "oncogene" & mutation_type == "D")) %>%
        select(gene_symbol, role_in_cancer, role, mutation_types, mutation_type) %>%
        kable(format = "html", booktabs = T, caption = "TSG and oncogene") %>%
        kable_styling(latex_options = c("striped", "hold_position"),
                full_width = F)
TSG and oncogene
gene_symbolrole_in_cancerrolemutation_typesmutation_type
APOBEC3Boncogene, TSGoncogeneDD
BIRC3oncogene, TSG, fusiononcogeneD, F, N, T, MisD
DROSHATSGTSGA, Mis, N, FA
GPC3oncogene, TSGoncogeneD, Mis, N, F, SD
KDM6Aoncogene, TSGoncogeneD, N, F, SD
MAP2K4oncogene, TSGoncogeneD, Mis, ND
NKX2-1oncogene, TSGTSGAA
NTRK1oncogene, TSG, fusionTSGT, AA
PAX5oncogene, TSG, fusiononcogeneT, Mis, D, F, SD
WT1oncogene, TSG, fusiononcogeneD, Mis, N, F, S, TD
majority of the genes from the output can function as either oncogene or tumor suppressor genes (TSG), which is not totally suprising. Cancer is such a complex disease that a function of the gene is dependent on the context (e.g. cancer types). Interestingly, DROSHA is the only TSG and is amplified.