Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, January 24, 2018

ATACseq contamination of mycoplasma DNA

ATACseq contamination of mycoplasma DNA

I was doing ATACseq analysis for my labmates, and found the mapping rate is very low (~5%) to human genome. I went to talk with Elias who performed the experiment and confirmed that it is human cancer cell lines.

One of the problem that I knew is that it could be mycoplasma contaminations. I still remember the days when I cultured MCF7 breast cancer cell line during my PhD, I saw some black particles in the dish. later, I got to know it was mycoplasma (bacterial). The problem is that once you have it in the lab, they are very hard to get rid of. We had to use plasmocin to treat the cells.

To verify my postulations, I did one experiment below.

Download the mycoplasma genomes

This paper Mycoplasma contamination in the 1000 Genomes Project has looked into the mycoplasma contamination in 1000Genome project. They used 33 mycoplasma genomes:

Additional File 1
Mycoplasma Genomes Used
All the Mycoplasma genomes on FTP site ftp.ncbi.nih.gov files genomes/Bacteria/Mycoplasma * were
down loaded from (30 files, 24 November 2011) and incorporated into a Bowtie EBWT database and a
colorspace database.
Table S1: Thirty species of Mycoplasma whose Genomes were used
Genome fasta description
gi|148377268|ref|NC 009497.1|
gi|291319937|ref|NC 013948.1|
gi|193082772|ref|NC 011025.1|
gi|339320528|ref|NC 015725.1|
gi|313678134|ref|NC 014760.1|
gi|83319253|ref|NC 007633.1|
gi|240047135|ref|NC 012806.1|
gi|294155300|ref|NC 014014.1|
gi|308189587|ref|NC 014552.1|
gi|319776738|ref|NC 014921.1|
gi|294660180|ref|NC 004829.2|
gi|108885074|ref|NC 000908.2|
gi|321309518|ref|NC 014970.1|
gi|269114774|ref|NC 013511.1|
gi|54019969|ref|NC 006360.1|
gi|72080342|ref|NC 007332.1|
gi|71893359|ref|NC 007295.1|
gi|304372805|ref|NC 014448.1|
gi|313664890|ref|NC 014751.1|
gi|47458835|ref|NC 006908.1|
gi|330370665|ref|NC 015407.1|
gi|331703020|ref|NC 015431.1|
gi|127763381|ref|NC 005364.2|
gi|26553452|ref|NC 004432.1|
gi|13507739|ref|NC 000912.1|
gi|15828471|ref|NC 002771.1|
gi|344204770|ref|NC 015946.1|
gi|325972867|ref|NC 015155.1|
gi|325989358|ref|NC 015153.1|
gi|71894025|ref|NC 007294.1|

It turns out NCBI has archived those data to https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#oldcontent

The content of most of the old directories on the ftp://ftp.ncbi.nlm.nih.gov/genomes/ site, and the content previously at ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ is no longer being updated. Many old directories from these two areas were moved to archival subdirectories within the /genomes/ area on 2 December 2015. More old directories will be moved to the archive in 2017. Details of what FTP directories and files were moved are as follows.

    All directories and files from ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank
    The following directories from ftp://ftp.ncbi.nlm.nih.gov/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/
        Aedes_aegypti
        Anopheles_gambiae
        Arabidopsis_lyrata
        Arabidopsis_thaliana
        ASSEMBLY_BACTERIA
        Bacteria
        Bacteria_DRAFT
        Branchiostoma_floridae
        Caenorhabditis_elegans
        Chloroplasts
        CLUSTERS
        Drosophila_melanogaster
        Drosophila_pseudoobscura
        Fungi
        Medicago_truncatula
        MITOCHONDRIA
        Physcomitrella_patens
        PLANTS
        Plasmids
        Populus_trichocarpa
        Protozoa
        Sorghum_bicolor

Download the fasta file for all species

## need to specify the full path for Mycoplasma_* folder
mkdir mycoplasma_index
cd mycoplasma_index

wget -r --include-directories="genomes/archive/old_refseq/Bacteria/Mycoplasma_*" --no-parent -nH --cut-dir=5 -A "*fna" ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/

ls 

NC_000908.fna  NC_007332.fna  NC_014751.fna  NC_016638.fna  NC_018077.fna  NC_018495.fna  NC_021283.fna
NC_000912.fna  NC_007633.fna  NC_014760.fna  NC_016807.fna  NC_018149.fna  NC_018496.fna  NC_021831.fna
NC_002771.fna  NC_009497.fna  NC_014921.fna  NC_016829.fna  NC_018406.fna  NC_018497.fna  NC_022575.fna
NC_004432.fna  NC_011025.fna  NC_014970.fna  NC_017502.fna  NC_018407.fna  NC_018498.fna  NC_022807.fna
NC_004829.fna  NC_012806.fna  NC_015153.fna  NC_017503.fna  NC_018408.fna  NC_019552.fna  NC_023030.fna
NC_005364.fna  NC_013511.fna  NC_015155.fna  NC_017504.fna  NC_018409.fna  NC_019949.fna  NC_023062.fna
NC_006360.fna  NC_013948.fna  NC_015407.fna  NC_017509.fna  NC_018410.fna  NC_020076.fna
NC_006908.fna  NC_014014.fna  NC_015431.fna  NC_017519.fna  NC_018411.fna  NC_021002.fna
NC_007294.fna  NC_014448.fna  NC_015725.fna  NC_017520.fna  NC_018412.fna  NC_021025.fna
NC_007295.fna  NC_014552.fna  NC_015946.fna  NC_017521.fna  NC_018413.fna  NC_021083.fna

## there are 66 files
ls -1 | wc -l
66

build bowtie2 index

cat *fna > mycoplasma.fa

## small genomes, should finish within minutes
bowtie2-build mycoplasma.fa mycoplasma

mapping reads to Mycoplasma Genomes

I then run through my ATACseq snakemake pipeline, changing the config.yaml for the reference genome.

After re-aligning with the mycoplasma genome, I checked the mapping rate.

Astonishing mapping rate !!

cd 00log
cat *align | grep overall

55.81% overall alignment rate
89.93% overall alignment rate
52.83% overall alignment rate
84.31% overall alignment rate
87.33% overall alignment rate

Lessons learned

For cultured cells, it is very common to have mycoplasma contamination. It is critical to treat the cells with antibiotics, otherwise, you sequencing experiments will have a lot of bacterial sequences.

Not only for ATACseq, other sequencings such as RNAseq, WES, WGS etc can be affected as well.

I am thinking to add checking mycoplasma contamination to my snakemake pipelines. fastq_screen seems to be a good one https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

3 comments:

  1. Have you thought about using minimap2? It has a Python API (mappy) which you could use to iterate over a fastq file and quickly identify reads which map to one of the potential contaminant genomes.

    ReplyDelete
    Replies
    1. thanks for the information. I have not used minimap2 yet, fastq_screen seems to be doing OK for me.

      Delete
  2. Hello, friend! Could you show the changes that were made in config.yaml. I am trying to decontaminate a tryp genome, but I get lost in this procedure. Thx!

    ReplyDelete