Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Showing posts with label FASTq. Show all posts
Showing posts with label FASTq. Show all posts

Tuesday, October 6, 2015

convert bam to fastq and remap it with bwa mem

Why we want to revert bam files back to fastq file?

Sometimes, you get a processed bam file from other resources, but you want to map it with another aligner. For example, the old BWA aln does not keep the information for soft-cliped reads which are useful for structural variant detection, the new BWA mem is more suitable for this. So, I want to dump bam to fastq first and then realign it with BWA MEM.
After googleling around, I found:
It turns out that it is not that straightforward.

Use samtools

Read this (howto) Revert a BAM file to FastQ format in the GATK forum back to 2013.
Note for advanced users If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):
htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz
The htscmd bamshuf and htscmd bam2fq are legacy code, now they are under samtools.
But the point is that you need to shuffle the reads in the bam file.
from Hengli, the author of samtools and bwa:
The right way to create paired fastq for bwa is:
htscmd bamshuf -uO in.bam tmp-prefix | htscmd bam2fq -as se.fq.gz - | bwa mem -p ref.fa - If your bam is coordinate sorted, it is important to use "bamshuf" to change the ordering; otherwise bwa will fail to infer insert size for a batch of reads coming from centromeres.
BWA -p flag
"If mates.fq file is absent and option -p is not set, this command regards input reads are single-end. If mates.fq is present, this command assumes the i-th read in reads.fq and the i-th read in mates.fq constitute a read pair. If -p is used, the command assumes the 2i-th and the (2i+1)-th read in reads.fq constitute a read pair (such input file is said to be interleaved). In this case, mates.fq is ignored. In the paired-end mode, the mem command will infer the read orientation and >the insert size distribution from a batch of reads.
Updated version
Bam2fq -s only helps when your bam contains singletons. If your bam contains both ends, that option has no effect
samtools bamshuf -uon 128 in.bam tmp-prefix | samtools bam2fq -s se.fq.gz - | bwa mem -p ref.fa -

Use bedtools

From the bcbio source code:
cmd = ("{samtools} sort -n -o -l 1 -@ {num_cores} -m {max_mem} {in_bam} {prefix1} "
                       "| {bedtools} bamtofastq -i /dev/stdin -fq /dev/stdout -fq2 /dev/stdout "
                       "| {bwa} mem -p -M -t {num_cores} -R '{rg_info}' -v 1 {ref_file} - | ")
from the bedtools mannual page:
BAM should be sorted by query name (samtools sort -n aln.bam aln.qsort) if creating paired FASTQ with this option.
That's why the command above uses samtools sort -n to sort by name. My bam files are huge (300Gb). This might be very time consuming.

Speedseq realign

In the speedseq pipeline, there is a script called bamtofastq.py
bamtofastq.py
author: Ira Hall (ihall@genome.wustl.edu) and Colby Chiang (cc2qe@virginia.edu)
version: $Revision: 0.0.1 $
description: Convert a coordinate sorted BAM file to FASTQ
             (ignores unpaired reads)

positional arguments:
  BAM                   Input BAM file(s). If absent then defaults to stdin.

optional arguments:
  -h, --help            show this help message and exit
  -r STR, --readgroup STR
                        Read group(s) to extract (comma separated)
  -n, --rename          Rename reads
  -S, --is_sam          Input is SAM format
  -H FILE, --header FILE
                        Write BAM header to file
Check read group in the bam file
$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...
when align using bwa mem, one can add the read group info by:
bwa mem -R "@RG\tID:foo\tSM:bar"

some bam files contain reads from different read groups. one needs to convert the bam to different fastqs for each read group respectively and then realign each with bwa mem. Lastly, merge all the sorted bam together to get a realigned bam.

Check out HengLi's bwakit


samtools fastq requires the bam files sort by name
samtools collate is your friend

if not config["from_fastq"]:
    if config["realign"]:
        rule realign:
            input: get_input_files
            output: temp(["01aln_temp/{sample}.RG{id}.bam", "01aln_temp/{sample}.RG{id}.bam.bai"]), "00log/{sample}.RG{id}.align"
            log:
                bwa = "00log/{sample}.RG{id}.align"
            params:
                jobname = "{sample}_RG{id}",
                ## add read group for bwa mem mapping, change accordingly if you know PL:ILLUMINA, LB:library1 PI:200 etc...
                ## http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#non-file-parameters-for-rules
                rg = identify_read_groups,
                ID = identify_read_groups_ID
            message: "converting {input} file with read group {params.ID} to fastqs by bedtools and remapping with bwa mem"
            threads: 8
            shell:
                """
                samtools view -br {params.ID} {input} \
                | samtools sort -n -l 1 -@ 8 -m 2G -T 01aln_temp/{wildcards.sample}_{wildcards.id}.tmp \
                | bedtools bamtofastq -i /dev/stdin -fq /dev/stdout -fq2 /dev/stdout \
                | bwa mem -t 8 -p  -M -v 1 -R '{params.rg}' {config[ref_fa]} - 2> {log.bwa} \
                | samtools sort -m 2G -@ 8 -T {output[0]}.tmp -o {output[0]}
                samtools index {output[0]} 

Thursday, August 14, 2014

rename a bunch of files with bash by regular expression

I am at the MSU NGS 2014 course. My colleague wanted to follow the khmer protocol with his own data, but one of the steps has to use a certain file name convention.

In the protocol it requires fastq files listed as:   *001_R1.fastq.gz
001 is the replicate number, it can be 002 or 003 or any number of replicates you have. ( for RNA-seq, sequence as many as biological samples as possible !)
R1 is the pair-end reads 1, it can be R2

What he has is something like:
1_egg_r1_01_sub.fastq.gz

1 is the stage of the egg. He sequenced  4 eggs, so he has 1_egg, 2_egg., 3_egg and 4_egg
r1 is the pair-end reads 1
01 is the first replicates. He has two replicates for each egg.

Basically, he wants to rename these files to the khmer convention.

This problem gets down to writing a regular expression.
To recapture the problem, I made some dummy files:

mkdir foo && cd foo

I have a txt file contains the names of the file:
foo$ cat files.txt 
1egg_r1_01_sub.fastq.gz
1egg_r2_01_sub.fastq.gz
1egg_r1_02_sub.fastq.gz
1egg_r2_02_sub.fastq.gz
2egg_r1_01_sub.fastq.gz
2egg_r2_01_sub.fastq.gz
2egg_r1_02_sub.fastq.gz
2egg_r2_02_sub.fastq.gz
3egg_r1_01_sub.fastq.gz
3egg_r2_01_sub.fastq.gz
3egg_r1_02_sub.fastq.gz
3egg_r2_02_sub.fastq.gz
4egg_r1_01_sub.fastq.gz
4egg_r2_01_sub.fastq.gz
4egg_r1_02_sub.fastq.gz
4egg_r2_02_sub.fastq.gz

Now I want to make dummy files with the names in this file.
one can make the dummy files in a fly also.

=====update on 08/26/14======
one can use the {} expansion to create the dummy files

tommy@tommy-ThinkPad-T420[foo] touch {1,2,3,4}_r{1,2}_0{1,2}_sub.fastq.gz
tommy@tommy-ThinkPad-T420[foo] ls                                     [ 3:45PM]
1_r1_01_sub.fastq.gz  2_r2_01_sub.fastq.gz  4_r1_01_sub.fastq.gz
1_r1_02_sub.fastq.gz  2_r2_02_sub.fastq.gz  4_r1_02_sub.fastq.gz
1_r2_01_sub.fastq.gz  3_r1_01_sub.fastq.gz  4_r2_01_sub.fastq.gz
1_r2_02_sub.fastq.gz  3_r1_02_sub.fastq.gz  4_r2_02_sub.fastq.gz
2_r1_01_sub.fastq.gz  3_r2_01_sub.fastq.gz
2_r1_02_sub.fastq.gz  3_r2_02_sub.fastq.gz




========================
The difference of make_dummy_file.sh and make_dummy_file_1.sh is that I specified shebang line in the make_dummy_file.sh script to tell the bash that it is a bash script, to invoke it: ./make_dummy_file.sh files.txt

In contrast, to invoke the other two which I did not specify the shebang: bash make_dummy_file_1.sh bash make_dummy_file_2.sh

Rename the files with regular expression by either using sed or rename command

the rename command use the perl regular expression. use \ to escape $.
the sed command need to escape the () which are used to capture the back reference
before:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1_egg_r1_01_sub.fastq.gz 2_egg_r1_01_sub.fastq.gz 3_egg_r1_01_sub.fastq.gz 4_egg_r1_01_sub.fastq.gz copy make_dummy_file_1.sh 1_egg_r1_02_sub.fastq.gz 2_egg_r1_02_sub.fastq.gz 3_egg_r1_02_sub.fastq.gz 4_egg_r1_02_sub.fastq.gz dummy make_dummy_file_2.sh 1_egg_r2_01_sub.fastq.gz 2_egg_r2_01_sub.fastq.gz 3_egg_r2_01_sub.fastq.gz 4_egg_r2_01_sub.fastq.gz files.txt rename.sh 1_egg_r2_02_sub.fastq.gz 2_egg_r2_02_sub.fastq.gz 3_egg_r2_02_sub.fastq.gz 4_egg_r2_02_sub.fastq.gz make_dummy_file.sh rename_one_liner.sh 

after:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1egg_R1_001.fastq.gz 2egg_R1_001.fastq.gz 3egg_R1_001.fastq.gz 4egg_R1_001.fastq.gz copy make_dummy_file_1.sh 1egg_R1_002.fastq.gz 2egg_R1_002.fastq.gz 3egg_R1_002.fastq.gz 4egg_R1_002.fastq.gz dummy make_dummy_file_2.sh 1egg_R2_001.fastq.gz 2egg_R2_001.fastq.gz 3egg_R2_001.fastq.gz 4egg_R2_001.fastq.gz files.txt rename.sh 1egg_R2_002.fastq.gz 2egg_R2_002.fastq.gz 3egg_R2_002.fastq.gz 4egg_R2_002.fastq.gz make_dummy_file.sh rename_one_liner.sh

References: http://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions

http://stackoverflow.com/questions/10929453/bash-scripting-read-file-line-by-line
https://www.cs.tut.fi/~jkorpela/perl/regexp.html

Thursday, June 26, 2014

quality control of your fastq file! my first post after I got my PhD!

I passed my defense on Tuesday, and now I am a PhD! what a long process, it took me six years. I've been enjoying the PhD experience though.

OK, let me go back to this post. things to remember:
1) You must quality control your fastq files first by fastqc. Trim the adapters and low quality bases by trimmomatic.
2) The fastq files dumped from the SRA GEO repository are usually the raw sequence, you need to quality control them.
3) google, google before you ask anyone else!

Let's start.
inspecting the first several reads of the fastq (this is a ChIP-seq data), I found the quality is coded as phred33

@SRR866627_nutlin_p53.1 HWI-ST571:161:D0YP4ACXX:3:1101:1437:2055 length=51
NCGAAAGACTGCTGGCCGACGTCGAGGTCCCGATTGTCGGCGTCGGCGGCA
+SRR866627_nutlin_p53.1 HWI-ST571:161:D0YP4ACXX:3:1101:1437:2055 length=51
#1:AABDDH<?CF<CFHH@D:??FAEE6?FHDAA=@CF@4@EBA88@;575
@SRR866627_nutlin_p53.2 HWI-ST571:161:D0YP4ACXX:3:1101:1423:2163 length=51
GAAATACTGTCTCTACTAAAAATACAAAAAGTAGCCAGGCGTCGTGGTGCA
+SRR866627_nutlin_p53.2 HWI-ST571:161:D0YP4ACXX:3:1101:1423:2163 length=51
B@CFFFFFHHHHGIJJJJIJJJIJJJJIJJIEGHIJJJJIJIIIHIE=FGC

To know different quality coding see here:
  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126
  0........................26...31.......40                                
                           -5....0........9.............................40 
                                 0........9.............................40 
                                    3.....9.............................40 
  0.2......................26...31.........41                              

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
     with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
     (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

then I used fastqc to check the quality:





the quality for each base is very good, but the perbase sequence content does not look good, and the adapters are not trimmed as indicated by the overrepresented sequences (highlighted part is part of the Truseq adapter index1).

Then I used Trimmomatic to trim the adapters and the low quality bases.
$ trimmomatic SE  in.fastq in_trimmed.fastq ILLUMINACLIP:Truseq_adaptor.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

the Truseq_adaptor.fa file contains the adapter sequences. see a post here.

$ cat Truseq_adaptor.fa
>TruSeq_Universal_Adapter
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>TruSeq_Adapter_Index_1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_3
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_4
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_5
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_6
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_7
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_8
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_9
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_10
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_11
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_12
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG
after that, I checked the with fastqc again:


The perbase quality was improved and the per base GC content became more normal (should be 25% percent for each A,T,C,G base, it looks like the sequences are more A/T rich). More importantly, no overrepresent sequences were found anymore. Now the fastq file is ready for subsequent mapping by bowtie.

sometime ago, I quality controlled a RNA-seq data, and found found many wired things and asked on biostar.





Take home messages:
1)
"The weird per-base content is extremely normal and is typically referred to as the "random" hexamer effect, since things aren't actually amplified in a uniform manner. Most people will have this and it's nothing to worry about. BTW, don't let anyone talk you into trimming that off, it's the legitimate sequence."

2)
"Duplicates are expected in RNAseq. Firstly, your samples are probably full of rRNAs, even if you performed ribodepletion. Secondly, any highly expressed gene is going to have apparent PCR duplication due solely to it being highly expressed (there is a ceiling on the coverage you can get before running into this, it's pretty high with paired-end reads but not infinite)."

3)Several posts for whether the duplicates need to be removed or not.
https://www.biostars.org/p/55648/
https://www.biostars.org/p/66831/
https://www.biostars.org/p/14283/

Currently, I do not remove them. That's all for today!






Tuesday, March 18, 2014

qseq to fastq conversion

I was looking at some old ChIP-seq data with raw file in qseq format. I want to convert them to fastq file for mapping with bowtie.

A quick google I found:
http://www.biostars.org/p/6682/

the qseqtofastq C++ program http://www.dna.bio.keio.ac.jp/~krisp/qseq2fastq/ has to be compiled by scons, and I encountered some compiling problem. I gave it up and used this java program http://sourceforge.net/projects/snpeff/files/qseq2fastq.jar/download

Usage:

zcat myfile.qseq.tgz | java -jar qseqtofastq.jar -phred64 > myfile.fastq

it took me around 1 hour to finish the conversion of a 600MB tgz file on the computing cluster (single cpu 4GB ram).

I did have a problem at the end:

Exception in thread "main" java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 8
        at ca.mcgill.mcb.pcingola.Qseq2Fastq.main(Qseq2Fastq.java:51)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 8
        at ca.mcgill.mcb.pcingola.Qseq2Fastq.main(Qseq2Fastq.java:43)


I counted the lines of the fastq file (divided by 4) and the original qseq file  , and they are equal.  So, I went ahead and mapped the fastq file with bowtie.



Tuesday, June 25, 2013

use sed to print out every n lines, split pair-end FASTQ file



tommy@tommy-ThinkPad-T420:~$ cat test.txt 
1
2
3
4
5
6
7
8
9
10
11
12

#print out the first two lines of every 4 lines. -n flag suppress all of the other lines and only print the line 
you specified. -e option tells sed to accept multiple p (print) command.


tommy@tommy-ThinkPad-T420:~$ sed -ne '1~4p;2~4p' test.txt 
1
2
5
6
9
10


This trick would be useful if you have a pair-end FASTq file and want to split it into two files.

see here:

and here http://www.biostars.org/p/19446/ two reads in one fastq

from SRA file:

How to extract paired-end reads from SRA files

SRA(NCBI) stores all the sequencing run as single "sra" or "lite.sra" file. You may want separate files if you want to use the data from paired-end sequencing. When I run SRA toolkit's "fastq-dump" utility on paired-end sequencing SRA files, sometimes I get only one files where all the mate-pairs are stored in one file rather than two or three files.
The solution for the problem is to always run fastq-dump with "--split-3" option. If the experiment is single-end sequencing, only one fastq file will be generated. If it is paired-end sequencing, there may be two or three fastq files.
Two files (with suffix "_1" and "_2") are matched mate-pair read file where as the third one (without any suffix) contains all the reads that do not have any mate-paires (or SRA couldn't resolve mate-paires for them).

Hope my experiences with NCBI SRA data handling help the readership.

Tuesday, June 11, 2013

FASTA/Q file manipulation tool readfq

it is developed by Heng Li in princeton.
https://github.com/lh3/readfq

it is super fast and I will have a try.


Readfq is a collection of routines for parsing the FASTA/FASTQ format. It
seamlessly parses both FASTA and multi-line FASTQ with a simple interface.

Readfq is first implemented in a single C header file and then ported to Lua,
Perl and Python as a single function less than 50 lines. For users of scripting
languages, I encourage to copy-and-paste the function instead of using readfq
as a library. It is always good to avoid unnecessary library dependencies.

Readfq also strives for efficiency. The C implementation is among the fastest
(if not the fastest). The Python and Perl implementations are several to tens
of times faster than the official Bio* implementations. If you can speed up
readfq further, please let me know. I am not good at optimizing programs in
scripting languages. Thank you.

As to licensing, the C implementation is distributed under the MIT license.
Implementations in other languages are released without a license. Just copy
and paste. You do not need to acknowledge me. The following shows a brief
example for each programming language:


  # Perl
  my @aux = undef; # this is for keeping intermediate data
  while (my ($name, $seq, $qual) = readfq(\*STDIN, \@aux)) { print "$seq\n"; }


  # Python: generator function
  for name, seq, qual in readfq(sys.stdin): print seq


  -- Lua: closure
  for name, seq, qual in readfq(io.stdin) do print seq end

  /* Go */
  package main

  import (
    "fmt"
    "bufio"
    "github.com/drio/drio.go/bio/fasta"
  )

  func main() {
    var fqr fasta.FqReader
    fqr.Reader = bufio.NewReader(os.Stdin)
    for r, done := fqr.Iter(); !done; r, done = fqr.Iter() {
      fmt.Println(r.Seq)
    }
  }

  /* C */
  #include <zlib.h>
  #include <stdio.h>
  #include "kseq.h"
  KSEQ_INIT(gzFile, gzread)

  int main() {
    gzFile fp;
    kseq_t *seq;
    fp = gzdopen(fileno(stdin), "r");
    seq = kseq_init(fp);
    while (kseq_read(seq) >= 0) puts(seq->seq.s);
    kseq_destroy(seq);
    gzclose(fp);
    return 0;
  }


Some naive benchmarks. To convert a FASTQ containing 25 million 100bp reads to FASTA,
FASTX-Toolkit (parsing 4-line FASTQ only) takes 325.0 CPU seconds and EMBOSS' seqret
247.8 seconds. My seqtk, which uses the kseq.h library, finishes the task in 24.6
seconds, 10X faster. For retrieving 25k sequences by name from the same FASTQ,
BioPython takes 963 seconds, while readfq.py takes 136 seconds; BioPerl takes more
than 40 minutes (killed), while readfq.pl 273 seconds. Seqtk takes 29 seconds.

Thursday, May 16, 2013

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file


see post here:

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

I was reading through a paper on comparative ChIP-Seq when I found this awk gem that lets you get some very basic stats very quickly on next generation sequencing reads. To use, simply cat the fastq file (or gunzip -c) and pipe that to this awk command:


cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:

99115 60567 61.1078 ACCTCAGGA 354 0.357161

This is telling you:
  1. The total number of reads (99,115).
  2. The number of unique reads (60,567).
  3. The frequency of unique reads as a proportion of the total (61%).
  4. The most abundant sequence (useful for finding adapters, linkers, etc).
  5. The number of times that sequence is present (354).
  6. The frequency of that sequence as a proportion of the total number of reads (0.35%).
If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:

for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done

This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.

Check out FASTQC for more extensive quality assessment.

Thursday, April 4, 2013

Cool use of Unix paste with NGS sequences:FASTQ to FASTA


http://thegenomefactory.blogspot.com.au/search?updated-max=2012-09-08T12:11:00%2B10:00&max-results=7

Cool use of Unix paste with NGS sequences

While browsing SeqAnswers.com  today I came across a post where Uwe Appelt provided a couple of lines of Unix shell wizadry to solve some problem. What attracted my attention was the following:
paste - - - - < in.fq | filter | tr "\t" "\n" > out.fq
Now, I've done a reasonable amount of shell one-liners in my life, but I'd never seen this before.  I've used the paste command a couple of times, but clearly its potential power did not sink in! Here is the man page description for paste:
Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.
So what's happening here? Well, in Unix, the "-" character means to use STDIN instead of a filename. Here Uwe is providing paste with four filenames, each of which is the same stdin filehandle. So lines 1..4 of input.fq are put onto one line (with tab separator), and lines 5..8 on the next line and so on. Now, our stream has the four lines of FASTQ entry on a single line, which makes it much more amenable to Unix line-based manipulation, represented by filter in my example. Once that's all done, we need to put it back into the standard 4-line FASTQ format, which is as simple as converting the tabs "\t" back to newlines "\n" with the tr command.

Example 1: FASTQ to FASTA



A common thing to do is convert FASTQ to FASTA, and we don't always have our favourite tool or script to to this when we aren't on our own servers:


paste - - - - < in.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > out.fa
  1. paste converts the input FASTQ into a 4-column file
  2. cut command extracts out just column 1 (the ID) and column 2 (the sequence)
  3. sed replaces the FASTQ ID prefix "@" with the FASTA ID prefix ">"
  4. tr conversts the 2 columns back into 2 lines

And because the shell command above uses a pipe connecting four commands (paste, cut, sed, tr) the operating system will run them all in parallel, which will make it run faster assuming your disk I/O can keep up. 

Example 2: Removing redundant FASTQ ID in line 3

The third line in the FASTQ format is somewhat redundant - it is usually a duplicate of the first line, except with "+" instead of "@" to denote that a quality string is coming next rather than an ID. Most parsers ignore it, and happily accept a blank ID after the "+", which saves a fair chunk of disk space. If you have legacy files with the redundant IDs and want to conver them, here's how we can do it with our new paste trick:

paste -d ' ' - - - - | sed 's/ +[^ ]*/ +/' | tr " " "\n"
  1. paste converts the input FASTQ into a 4-column file, but using SPACE instead of TAB as the separator character
  2. sed finds and replaces the "+DUPE_ID" line with just a "+"
  3. tr conversts the 4 columns back into 4 lines

That's it for today, hope you learnt something, because I certainly did.