Diving into Genetics and Genomics: speed up grep

Wednesday, October 22, 2014

speed up grep

I had a list of gene names in txt file. There are around 500 genes with one gene name in one line, and I want to filter the gtf file from ensemble Homo_sapines.GRCh37.74gtf.gz

the gtf file contains 2244857 lines. I used grep to do it, but it takes very long (~1 hour).

what I used:

zcat Homo_sapines.GRCh37.74gtf.gz | grep -f gene_names.txt -w > my_genes.gtf

I searched on line, and found several posts in stackoverflow to speed up grep:
http://stackoverflow.com/questions/14602963/faster-grep-function-for-big-27gb-files
http://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up
http://stackoverflow.com/questions/9066609/fastest-possible-grep

options to speed up:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression

I then used:

zcat Homo_sapines.GRCh37.74gtf.gz | LC_ALL=C fgrep -f gene_names.txt -w > my_genes.gtf

It runs much faster!

1 comment:

midnMarch 19, 2020 at 8:03 PM
uk replica watches, combining elegant style and cutting-edge technology, a variety of styles of replica breitling watches, the pointer walks between your exclusive taste style.
ReplyDelete
Replies

Add comment

Diving into Genetics and Genomics

My github papge

Wednesday, October 22, 2014

speed up grep

1 comment:

Labels

My Blog List