An error
I was using TEQC to do quality control of my WES bam files aligned by bwa-mem. My data are paired end, so a function
reads2pairs
is called to make the paired-end reads to be a single fragment. I then get this error:> readpairs <- reads2pairs(reads)
Error in reads2pairs(reads) : read pair IDs do not seem to be unique
I asked in the bioconductor support site and went to the source code of that function. It turns out that
TEQC
was complaining about the not uniquely mapped reads.How does BWA-MEM determine and mark the not uniquely mapped reads?
google directed me to biostars again with no surprising. Please read this three posts:
https://www.biostars.org/p/76893/
https://www.biostars.org/p/167892/
https://www.biostars.org/p/76893/
https://www.biostars.org/p/167892/
one extract multi-mapped reads by looking for mapq < 23 and/or the XA flag on the reads
samtools view -q 40 file.bam
samtools view -q 1 file.bam
samtools view -q 23 file.bam
BWA added tags (Tags starting with ‘X’ are specific to BWA):
Tag | Meaning |
---|---|
NM | Edit distance |
MD | Mismatching positions/bases |
AS | Alignment score |
BC | Barcode sequence |
X0 | Number of best hits |
X1 | Number of suboptimal hits found by BWA |
XN | Number of ambiguous bases in the referenece |
XM | Number of mismatches in the alignment |
XO | Number of gap opens |
XG | Number of gap extentions |
XT | Type: Unique/Repeat/N/Mate-sw |
XA | Alternative hits; format: (chr,pos,CIGAR,NM;)* |
XS | Suboptimal alignment score |
XF | Support from forward/reverse alignment |
XE | Number of supporting seeds |
Checking my own bam files
samtools view my.bam | grep "XA:" | head -2
E00514:124:H2FC7CCXY:1:1221:16741:50076 97 1 69019 0 151M = 69298 430 CTCCTTCTCTTCTTCAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTAT @@CC?=D@E@=F@>F>?FG==?FAGG?F?FFGA<=?>GFAGF?@=G>?=F?>E@>F=G>>>E?=>>;>D@E@;<EE<E=CAE<>;<D@<<<==D@ED@<D@D@E>E<<=D@E@E>=>C@DD@E=CD@<DD@<>=><>G>>G>>>=;;?;;> XA:Z:15,-102463184,151M,0;19,+110607,151M,1; MC:Z:151M BD:Z:NNNNNNNONONNONNONNOONNNOOONONOONONNNGNNNOONNNOONNOONNNOOOMONFNNONNNNNONONNOOOMOOOOONNNNOONGGGNOOOLOOONONOOOONNONOOONNNONNOONOONNNNNNNNGNNOMNNMNGGGGNMNN MD:Z:151 BI:Z:QQPQQQQPOPQQQQQQQQQQQQQQRQQRQQQQRQQQMQQQRRQQQRQQQRQQQQQQRQQQKQQQQQQQQQPQQQRRQQQQQRQQQQQQQQMMMQQRRPQRQPQPQRQRQQQPQRQQQQQPQQRQQQQQQQQQQQMQQRQQQQQMMMMQQQQ NM:i:0 MQ:i:0 AS:i:151 XS:i:151 RG:Z:F17042956987-KY290-2-WES
E00514:124:H2FC7CCXY:1:2219:12855:17641 81 1 69039 0 150M 13 83828251 0 AACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTT =;C?FEAFAFGF<>>>=FF@DD=<@D=<;E<<?D@C@E=<<?D;<;<<E<E;=@DC@D<E@D=<==E<>>=><E@EDD<E=E=E@D<>>F>G@CF>=F>FGAF=GF@?FG===><=@E??E??????F==??F==?FF@EE=<=>D@B@@ XA:Z:15,+102463165,150M,0;19,-110627,150M,1; BD:Z:NOONOOONNNOONNFNNOOONNNOONNOONNNNNOMONGNNNNNNNNOONONOOOMOOOOONNNNOONFFFNNOOLOONONONOOONNNOOONNNNOONOOOOOONNNOONNFNNOMNNMNFFFFNMNNNNNONOONNNNNNOONMMNNN MD:Z:150 BI:Z:QRRQQRQPQQRQQQKQQQRQQQQQQQQRQQQQQPQPQQMQQQQQQQQRQQQQQQQPQRRRQQQQQQRQKKKQQRQPQQQQQQQRQRQQQQQRQQQQRQQRRQRQQQQQQQQQKQQQOQQOQKKKKQOQQQQQQQQQQPQQPPQQQPOQQQ NM:i:0 AS:i:150 XS:i:150 RG:Z:F17042956987-KY290-2-WES
Indeed, most of the reads with
XA:
tag has a quality score of 0 (fifth column).samtools view -q 1 my.bam | wc -l
7787794
samtools view my.bam | grep -v "XA:" | wc -l
7972777
## not exactly the same number, what's wrong?
samtools view my.bam | grep "XA:" | awk '{print $5}' | sort | uniq -c
201878 0
463 1
688 10
666 11
677 12
693 13
271 14
777 15
281 16
414 17
564 18
1429 19
192 2
327 20
3772 21
1674 22
742 23
1543 24
3106 25
368 26
6223 27
498 28
514 29
760 3
830 30
1526 31
954 32
3726 33
367 34
343 35
379 36
641 37
150 38
1082 39
442 4
3058 40
673 41
866 42
2570 43
1285 44
6374 45
6885 46
7669 47
22571 48
17666 49
611 5
16128 50
13824 51
9864 52
6277 53
4328 54
2568 55
1440 56
4547 57
1462 58
1888 59
1171 6
169162 60
2 61
3 62
1 64
5 65
1 67
2 69
1611 7
86 70
2234 8
4013 9
so, many reads with
XA
tag with mapping quality scores > 0 !!retain only uniquely mapped reads
samtools view -h my.bam | awk '$17 !~ /XA:/|| $1 ~ /^@/' | samtools view -bS - > my_unique.bam
Stumbled upon your post while looking for uniq vs mutli-mapped reads. Good to know about MAPQ ~0 for most of alternative hits, XA. Your last comand to retain uniquely mapped reads should need some refinement.
ReplyDelete```
samtools view -@ 6 -q 1 -F 4 -F 256 -h mybam.aln.dup.realn.recal.rp.bam | grep -v -E -e '\bXA:Z:' -e '\bSA:Z:' | samtools view -b -T /projects/verhaak-lab/DogWGSReference/CanFam3_1.fa - > mybam.aln.dup.realn.recal.rp.bam.uniqmapped.bam
```
Details at https://gist.github.com/sbamin/27acf13f2a28161efbf89a273559bca4
That `awk '$17 !~ /XA:/` part can give you incorrect hits as I doubt if SAM format and/or bwa mem aligner adheres to column-based tagging of XA,XT, etc. flags. Besides, columns may shift in appearance if additional tools like GATK best practices are used (markdup, bqsr, etc.) to generate alignment records.
ReplyDeleteFound a much faster and stable code using sambamba: https://bioinformatics.stackexchange.com/a/4542/2978
ReplyDeleteHave not checked this blog for long...there are many spams for the comments, I now switched to blogdown :) Thanks for your comments.
Deletemy all dog training needs got fulfilled by installing this dog whistle app, highly recommend it.
ReplyDeletecheap mont blanc Pen uk, combining elegant style and cutting-edge technology, a variety of styles of cheap mont blanc bonheur Pen, the pointer walks between your exclusive taste style.
ReplyDeleteHave you attempted this new app ESO The Pricechecker Apk : that is certainly exceptional.
ReplyDelete