see post here:
Awk Command to Count Total, Unique, and the Most Abundant Read in a FASTQ file
I was reading through a paper on comparative ChIP-Seq when I found this awk gem that lets you get some very basic stats very quickly on next generation sequencing reads. To use, simply cat the fastq file (or gunzip -c) and pipe that to this awk command:
cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'
The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:
99115 60567 61.1078 ACCTCAGGA 354 0.357161
This is telling you:
- The total number of reads (99,115).
- The number of unique reads (60,567).
- The frequency of unique reads as a proportion of the total (61%).
- The most abundant sequence (useful for finding adapters, linkers, etc).
- The number of times that sequence is present (354).
- The frequency of that sequence as a proportion of the total number of reads (0.35%).
If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:
for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done
This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.
Check out FASTQC for more extensive quality assessment.
SAMStat is a decent tool for sam bam file, like FASTQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (it is from Babraham bioinformatics) but for mapping stats.
ReplyDelete