Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, August 10, 2017

bwa-mem multi-mapping reads

An error

I was using TEQC to do quality control of my WES bam files aligned by bwa-mem. My data are paired end, so a function reads2pairs is called to make the paired-end reads to be a single fragment. I then get this error:
> readpairs <- reads2pairs(reads)
Error in reads2pairs(reads) : read pair IDs do not seem to be unique
asked in the bioconductor support site and went to the source code of that function. It turns out that TEQC was complaining about the not uniquely mapped reads.

How does BWA-MEM determine and mark the not uniquely mapped reads?

google directed me to biostars again with no surprising. Please read this three posts:
https://www.biostars.org/p/76893/
https://www.biostars.org/p/167892/
one extract multi-mapped reads by looking for mapq < 23 and/or the XA flag on the reads
samtools view -q 40 file.bam
samtools view -q 1 file.bam
samtools view -q 23 file.bam
BWA added tags (Tags starting with ‘X’ are specific to BWA):
TagMeaning
NMEdit distance
MDMismatching positions/bases
ASAlignment score
BCBarcode sequence
X0Number of best hits
X1Number of suboptimal hits found by BWA
XNNumber of ambiguous bases in the referenece
XMNumber of mismatches in the alignment
XONumber of gap opens
XGNumber of gap extentions
XTType: Unique/Repeat/N/Mate-sw
XAAlternative hits; format: (chr,pos,CIGAR,NM;)*
XSSuboptimal alignment score
XFSupport from forward/reverse alignment
XENumber of supporting seeds

Checking my own bam files

samtools view my.bam | grep "XA:" | head -2

E00514:124:H2FC7CCXY:1:1221:16741:50076 97 1 69019 0 151M = 69298 430 CTCCTTCTCTTCTTCAAGGTAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTAT @@CC?=D@E@=F@>F>?FG==?FAGG?F?FFGA<=?>GFAGF?@=G>?=F?>E@>F=G>>>E?=>>;>D@E@;<EE<E=CAE<>;<D@<<<==D@ED@<D@D@E>E<<=D@E@E>=>C@DD@E=CD@<DD@<>=><>G>>G>>>=;;?;;> XA:Z:15,-102463184,151M,0;19,+110607,151M,1; MC:Z:151M BD:Z:NNNNNNNONONNONNONNOONNNOOONONOONONNNGNNNOONNNOONNOONNNOOOMONFNNONNNNNONONNOOOMOOOOONNNNOONGGGNOOOLOOONONOOOONNONOOONNNONNOONOONNNNNNNNGNNOMNNMNGGGGNMNN MD:Z:151 BI:Z:QQPQQQQPOPQQQQQQQQQQQQQQRQQRQQQQRQQQMQQQRRQQQRQQQRQQQQQQRQQQKQQQQQQQQQPQQQRRQQQQQRQQQQQQQQMMMQQRRPQRQPQPQRQRQQQPQRQQQQQPQQRQQQQQQQQQQQMQQRQQQQQMMMMQQQQ NM:i:0 MQ:i:0 AS:i:151 XS:i:151 RG:Z:F17042956987-KY290-2-WES
E00514:124:H2FC7CCXY:1:2219:12855:17641 81 1 69039 0 150M 13 83828251 0 AACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTT =;C?FEAFAFGF<>>>=FF@DD=<@D=<;E<<?D@C@E=<<?D;<;<<E<E;=@DC@D<E@D=<==E<>>=><E@EDD<E=E=E@D<>>F>G@CF>=F>FGAF=GF@?FG===><=@E??E??????F==??F==?FF@EE=<=>D@B@@ XA:Z:15,+102463165,150M,0;19,-110627,150M,1; BD:Z:NOONOOONNNOONNFNNOOONNNOONNOONNNNNOMONGNNNNNNNNOONONOOOMOOOOONNNNOONFFFNNOOLOONONONOOONNNOOONNNNOONOOOOOONNNOONNFNNOMNNMNFFFFNMNNNNNONOONNNNNNOONMMNNN MD:Z:150 BI:Z:QRRQQRQPQQRQQQKQQQRQQQQQQQQRQQQQQPQPQQMQQQQQQQQRQQQQQQQPQRRRQQQQQQRQKKKQQRQPQQQQQQQRQRQQQQQRQQQQRQQRRQRQQQQQQQQQKQQQOQQOQKKKKQOQQQQQQQQQQPQQPPQQQPOQQQ NM:i:0 AS:i:150 XS:i:150 RG:Z:F17042956987-KY290-2-WES
Indeed, most of the reads with XA: tag has a quality score of 0 (fifth column).
samtools view -q 1 my.bam | wc -l
7787794
samtools view my.bam | grep -v "XA:" | wc -l
7972777

## not exactly the same number, what's wrong?
samtools view my.bam | grep  "XA:" | awk '{print $5}' | sort | uniq -c 
201878 0
    463 1
    688 10
    666 11
    677 12
    693 13
    271 14
    777 15
    281 16
    414 17
    564 18
   1429 19
    192 2
    327 20
   3772 21
   1674 22
    742 23
   1543 24
   3106 25
    368 26
   6223 27
    498 28
    514 29
    760 3
    830 30
   1526 31
    954 32
   3726 33
    367 34
    343 35
    379 36
    641 37
    150 38
   1082 39
    442 4
   3058 40
    673 41
    866 42
   2570 43
   1285 44
   6374 45
   6885 46
   7669 47
  22571 48
  17666 49
    611 5
  16128 50
  13824 51
   9864 52
   6277 53
   4328 54
   2568 55
   1440 56
   4547 57
   1462 58
   1888 59
   1171 6
 169162 60
      2 61
      3 62
      1 64
      5 65
      1 67
      2 69
   1611 7
     86 70
   2234 8
   4013 9
so, many reads with XA tag with mapping quality scores > 0 !!

retain only uniquely mapped reads

samtools view -h my.bam | awk '$17 !~ /XA:/|| $1 ~ /^@/' | samtools view -bS - > my_unique.bam

Friday, August 4, 2017

dangerous rm command

rm command is very dangerous because after you remove something, you can not recover it. There is no trash bin in the unix system. If you have some raw data (e.g fastq files), you'd better make them safe by changing the file permissions. in an empty directory, make a folder foo:
mkdir test
cd test
mkdir foo
cd foo
touch {1..4}.fastqs
ls
1.fastqs  2.fastqs  3.fastqs  4.fastqs
cd ..
let's first make the foo directory unwritable
ls -l
drwxr-x--- 2 krai genomic_med   512 Aug  4 22:27 foo

ls -l foo
total 0
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 1.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 2.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 3.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 4.fastqs

#remove the write privilege for the foo folder
chmod u-w foo
ls -l 
dr-xr-x--- 2 krai genomic_med 512 Aug  4 22:31 foo

# the files inside the foo folder does not change
ls -l foo
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 1.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 2.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 3.fastqs
-rw-r----- 1 krai genomic_med 0 Aug  4 22:31 4.fastqs

# now you can not remove the foo folder:
rm -rf foo
rm: cannot remove `foo/2.fastqs': Permission denied
rm: cannot remove `foo/1.fastqs': Permission denied
rm: cannot remove `foo/4.fastqs': Permission denied
rm: cannot remove `foo/3.fastqs': Permission denied

# rm -rf foo/*
rm: cannot remove `foo/1.fastqs': Permission denied
rm: cannot remove `foo/2.fastqs': Permission denied
rm: cannot remove `foo/3.fastqs': Permission denied
rm: cannot remove `foo/4.fastqs': Permission denied
let's make the fastq files unwritable, but change the foo folder back to default:
chmod u+w foo
ls -l 
drwxr-x--- 2 krai genomic_med 512 Aug  4 22:31 foo

chmod u-w foo/*fastqs
ls -l foo
-r--r----- 1 krai genomic_med 0 Aug  4 22:31 1.fastqs
-r--r----- 1 krai genomic_med 0 Aug  4 22:31 2.fastqs
-r--r----- 1 krai genomic_med 0 Aug  4 22:31 3.fastqs
-r--r----- 1 krai genomic_med 0 Aug  4 22:31 4.fastqs

# let's try to remove the fastqs
rm foo/*fastqs
rm: remove write-protected regular empty file `foo/1.fastqs'? 
The unix system asks to confirm deletion of the file. let's remove by force:
rm -rf foo/*fastqs
# the system even did not ask!
ls foo/
# nothing!
The files are removed! You can not recover them. see a post here https://unix.stackexchange.com/questions/48579/why-can-rm-remove-read-only-files
Any attempt to access a file's data requires read permission. Any attempt to modify a file's data requires write permission. Any attempt to execute a file (a program or a script) requires execute permission...
Because directories are not used in the same way as regular files, the permissions work slightly (but only slightly) differently. An attempt to list the files in a directory requires read permission for the directory, but not on the files within. An attempt to add a file to a directory, delete a file from a directory, or to rename a file, all require write permission for the directory, but (perhaps surprisingly) not for the files within. Execute permission doesn't apply to directories (a directory can't also be a program). But that permission bit is reused for directories for other purposes.
Execute permission is needed on a directory to be able to cd into it (that is, to make some directory your current working directory).
Execute is needed on a directory to access the "inode" information of the files within. You need this to search a directory to read the inodes of the files within. For this reason the execute permission on a directory is often called search permission instead.
Conclusion: All rm needs is write+execute permission on the parent directory. The permissions of the file itself are irrelevant.