Diving into Genetics and Genomics: Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

Thursday, May 16, 2013

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

see post here:

http://gettinggeneticsdone.blogspot.com/2012/04/awk-command-to-count-total-unique-and.html

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

I was reading through a paper on comparative ChIP-Seq when I found this awk gem that lets you get some very basic stats very quickly on next generation sequencing reads. To use, simply cat the fastq file (or gunzip -c) and pipe that to this awk command:

cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:

99115 60567 61.1078 ACCTCAGGA 354 0.357161

This is telling you:

The total number of reads (99,115).
The number of unique reads (60,567).
The frequency of unique reads as a proportion of the total (61%).
The most abundant sequence (useful for finding adapters, linkers, etc).
The number of times that sequence is present (354).
The frequency of that sequence as a proportion of the total number of reads (0.35%).

If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:

for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done

This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.

Check out FASTQC for more extensive quality assessment.

1 comment:

tommyMay 16, 2013 at 8:12 AM
SAMStat is a decent tool for sam bam file, like FASTQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (it is from Babraham bioinformatics) but for mapping stats.
ReplyDelete
Replies

Add comment

Diving into Genetics and Genomics

My github papge

Thursday, May 16, 2013

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

see post here:

Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file

1 comment:

Labels

My Blog List