 # Blue Collar Bioinformatics

One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.
We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.
My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.
That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.
The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 `from` `Bio ``import` `pairwise2` `def` `trim_adaptor(seq, adaptor, num_errors, right_side``=``True``):` `    ``gap_char ``=` `'-'` `    ``exact_pos ``=` `str``(seq).find(adaptor)` `    ``if` `exact_pos >``=` `0``:` `        ``seq_region ``=` `str``(seq[exact_pos:exact_pos``+``len``(adaptor)])` `        ``adapt_region ``=` `adaptor` `    ``else``:` `        ``seq_a, adaptor_a, score, start, end ``=` `pairwise2.align.localms(``str``(seq),` `                ``str``(adaptor), ``5.0``, ``-``4.0``, ``-``9.0``, ``-``0.5``,` `                ``one_alignment_only``=``True``, gap_char``=``gap_char)[``0``]` `        ``adapt_region ``=` `adaptor_a[start:end]` `        ``seq_region ``=` `seq_a[start:end]` `    ``matches ``=` `sum``((``1` `if` `s ``=``=` `adapt_region[i] ``else` `0``) ``for` `i, s ``in` `            ``enumerate``(seq_region))` `    ``# too many errors -- no trimming` `    ``if` `(``len``(adaptor) ``-` `matches) > num_errors:` `        ``return` `seq` `    ``# remove the adaptor sequence and return the result` `    ``else``:` `        ``return` `_remove_adaptor(seq, seq_region.replace(gap_char, ""),` `                ``right_side)`
If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute `right_side` allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 `def` `_remove_adaptor(seq, region, right_side``=``True``):` `    ``"""Remove an adaptor region and all sequence to the right or left.` `    ``"""` `    ``if` `right_side:` `        ``try``:` `            ``pos ``=` `seq.find(region)` `        ``# handle Biopython SeqRecords` `        ``except` `AttributeError:` `            ``pos ``=` `seq.seq.find(region)` `        ``return` `seq[:pos]` `    ``else``:` `        ``try``:` `            ``pos ``=` `seq.rfind(region)` `        ``# handle Biopython SeqRecords` `        ``except` `AttributeError:` `            ``pos ``=` `seq.seq.rfind(region)` `        ``return` `seq[pos``+``len``(region):]`
Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 `from` `__future__ ``import` `with_statement` `import` `sys` `import` `os` `from` `Bio ``import` `SeqIO` `from` `adaptor_trim ``import` `trim_adaptor` `def` `main(in_file):` `    ``adaptor_seq ``=` `"TCGTATGCCGTCTTCTGC"` `    ``num_errors ``=` `2` `    ``base, ext ``=` `os.path.splitext(in_file)` `    ``out_file ``=` `"%s-trimmed.fa"` `%` `(base)` `   `  `    ``with ``open``(in_file) as in_handle:` `        ``with ``open``(out_file, ``"w"``) as out_handle:` `            ``for` `rec ``in` `SeqIO.parse(in_handle, ``"fastq"``):` `                ``trim ``=` `trim_adaptor(rec.seq, adaptor_seq, num_errors)` `                ``if` `len``(trim) > ``0` `and` `len``(trim) < ``len``(rec):` `                    ``rec.letter_annotations ``=` `{}` `                    ``rec.seq ``=` `trim` `                    ``SeqIO.write([rec], out_handle, ``"fasta"``)`
You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance,NovoAlign appears to have similar alignment based trimming capabilities.