Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, June 25, 2013

use sed to print out every n lines, split pair-end FASTQ file



tommy@tommy-ThinkPad-T420:~$ cat test.txt 
1
2
3
4
5
6
7
8
9
10
11
12

#print out the first two lines of every 4 lines. -n flag suppress all of the other lines and only print the line 
you specified. -e option tells sed to accept multiple p (print) command.


tommy@tommy-ThinkPad-T420:~$ sed -ne '1~4p;2~4p' test.txt 
1
2
5
6
9
10


This trick would be useful if you have a pair-end FASTq file and want to split it into two files.

see here:

and here http://www.biostars.org/p/19446/ two reads in one fastq

from SRA file:

How to extract paired-end reads from SRA files

SRA(NCBI) stores all the sequencing run as single "sra" or "lite.sra" file. You may want separate files if you want to use the data from paired-end sequencing. When I run SRA toolkit's "fastq-dump" utility on paired-end sequencing SRA files, sometimes I get only one files where all the mate-pairs are stored in one file rather than two or three files.
The solution for the problem is to always run fastq-dump with "--split-3" option. If the experiment is single-end sequencing, only one fastq file will be generated. If it is paired-end sequencing, there may be two or three fastq files.
Two files (with suffix "_1" and "_2") are matched mate-pair read file where as the third one (without any suffix) contains all the reads that do not have any mate-paires (or SRA couldn't resolve mate-paires for them).

Hope my experiences with NCBI SRA data handling help the readership.

No comments:

Post a Comment