I was doing ATACseq analysis for my labmates, and found the mapping rate is very low (~5%) to human genome. I went to talk with Elias who performed the experiment and confirmed that it is human cancer cell lines.
One of the problem that I knew is that it could be mycoplasma contaminations. I still remember the days when I cultured MCF7 breast cancer cell line during my PhD, I saw some black particles in the dish. later, I got to know it was mycoplasma (bacterial). The problem is that once you have it in the lab, they are very hard to get rid of. We had to use plasmocin to treat the cells.
To verify my postulations, I did one experiment below.
This paper Mycoplasma contamination in the 1000 Genomes Project has looked into the mycoplasma contamination in 1000Genome project. They used 33 mycoplasma genomes:
Additional File 1
Mycoplasma Genomes Used
All the Mycoplasma genomes on FTP site ftp.ncbi.nih.gov files genomes/Bacteria/Mycoplasma * were
down loaded from (30 files, 24 November 2011) and incorporated into a Bowtie EBWT database and a
colorspace database.
Table S1: Thirty species of Mycoplasma whose Genomes were used
Genome fasta description
gi|148377268|ref|NC 009497.1|
gi|291319937|ref|NC 013948.1|
gi|193082772|ref|NC 011025.1|
gi|339320528|ref|NC 015725.1|
gi|313678134|ref|NC 014760.1|
gi|83319253|ref|NC 007633.1|
gi|240047135|ref|NC 012806.1|
gi|294155300|ref|NC 014014.1|
gi|308189587|ref|NC 014552.1|
gi|319776738|ref|NC 014921.1|
gi|294660180|ref|NC 004829.2|
gi|108885074|ref|NC 000908.2|
gi|321309518|ref|NC 014970.1|
gi|269114774|ref|NC 013511.1|
gi|54019969|ref|NC 006360.1|
gi|72080342|ref|NC 007332.1|
gi|71893359|ref|NC 007295.1|
gi|304372805|ref|NC 014448.1|
gi|313664890|ref|NC 014751.1|
gi|47458835|ref|NC 006908.1|
gi|330370665|ref|NC 015407.1|
gi|331703020|ref|NC 015431.1|
gi|127763381|ref|NC 005364.2|
gi|26553452|ref|NC 004432.1|
gi|13507739|ref|NC 000912.1|
gi|15828471|ref|NC 002771.1|
gi|344204770|ref|NC 015946.1|
gi|325972867|ref|NC 015155.1|
gi|325989358|ref|NC 015153.1|
gi|71894025|ref|NC 007294.1|
It turns out NCBI has archived those data to https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#oldcontent
The content of most of the old directories on the ftp://ftp.ncbi.nlm.nih.gov/genomes/ site, and the content previously at ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ is no longer being updated. Many old directories from these two areas were moved to archival subdirectories within the /genomes/ area on 2 December 2015. More old directories will be moved to the archive in 2017. Details of what FTP directories and files were moved are as follows.
All directories and files from ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank
The following directories from ftp://ftp.ncbi.nlm.nih.gov/genomes/ were archived to ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/
Aedes_aegypti
Anopheles_gambiae
Arabidopsis_lyrata
Arabidopsis_thaliana
ASSEMBLY_BACTERIA
Bacteria
Bacteria_DRAFT
Branchiostoma_floridae
Caenorhabditis_elegans
Chloroplasts
CLUSTERS
Drosophila_melanogaster
Drosophila_pseudoobscura
Fungi
Medicago_truncatula
MITOCHONDRIA
Physcomitrella_patens
PLANTS
Plasmids
Populus_trichocarpa
Protozoa
Sorghum_bicolor
## need to specify the full path for Mycoplasma_* folder
mkdir mycoplasma_index
cd mycoplasma_index
wget -r --include-directories="genomes/archive/old_refseq/Bacteria/Mycoplasma_*" --no-parent -nH --cut-dir=5 -A "*fna" ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/
ls
NC_000908.fna NC_007332.fna NC_014751.fna NC_016638.fna NC_018077.fna NC_018495.fna NC_021283.fna
NC_000912.fna NC_007633.fna NC_014760.fna NC_016807.fna NC_018149.fna NC_018496.fna NC_021831.fna
NC_002771.fna NC_009497.fna NC_014921.fna NC_016829.fna NC_018406.fna NC_018497.fna NC_022575.fna
NC_004432.fna NC_011025.fna NC_014970.fna NC_017502.fna NC_018407.fna NC_018498.fna NC_022807.fna
NC_004829.fna NC_012806.fna NC_015153.fna NC_017503.fna NC_018408.fna NC_019552.fna NC_023030.fna
NC_005364.fna NC_013511.fna NC_015155.fna NC_017504.fna NC_018409.fna NC_019949.fna NC_023062.fna
NC_006360.fna NC_013948.fna NC_015407.fna NC_017509.fna NC_018410.fna NC_020076.fna
NC_006908.fna NC_014014.fna NC_015431.fna NC_017519.fna NC_018411.fna NC_021002.fna
NC_007294.fna NC_014448.fna NC_015725.fna NC_017520.fna NC_018412.fna NC_021025.fna
NC_007295.fna NC_014552.fna NC_015946.fna NC_017521.fna NC_018413.fna NC_021083.fna
## there are 66 files
ls -1 | wc -l
66
cat *fna > mycoplasma.fa
## small genomes, should finish within minutes
bowtie2-build mycoplasma.fa mycoplasma
I then run through my ATACseq snakemake pipeline, changing the config.yaml
for the reference genome.
After re-aligning with the mycoplasma genome, I checked the mapping rate.
Astonishing mapping rate !!
cd 00log
cat *align | grep overall
55.81% overall alignment rate
89.93% overall alignment rate
52.83% overall alignment rate
84.31% overall alignment rate
87.33% overall alignment rate
For cultured cells, it is very common to have mycoplasma contamination. It is critical to treat the cells with antibiotics, otherwise, you sequencing experiments will have a lot of bacterial sequences.
Not only for ATACseq, other sequencings such as RNAseq, WES, WGS etc can be affected as well.
I am thinking to add checking mycoplasma contamination to my snakemake pipelines. fastq_screen seems to be a good one https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/