Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Sunday, March 22, 2015

benchmarking for shuf vs fast_sample

Sometimes we want to randomly select a proportion of lines from a txt file. The easiest way is to use the Unix command shuf. On a mac machine, you can install it by home brew.
brew install coreutils
But you need to invoke it as gshuf. https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-command-line-tools-in-mac-os-x/
I also came across a tool called fast_sample that can do the same thing
https://github.com/earino/fast_sample
I did some benchmarking for them.

# creat a test file
$time seq 1 10000000 > ten_million.txt
seq 1 10000000 > ten_million.txt 3.51s user 0.13s system 99% cpu 3.663 total
# it is a "big" file with size of 109M
$ls -lh ten_million.txt
-rw-r--r-- 1 Tammy staff 109M Mar 22 20:49 ten_million.txt
$man gshuf
# randomly select 1000 lines from it
$time gshuf -n 1000 ten_million.txt > /dev/null
gshuf -n 1000 ten_million.txt > /dev/null 0.79s user 0.03s system 96% cpu 0.853 total
# after git clone the fast_sample, make a soft link of the executable to /usr/local/bin
# so that I can invoke it anywhere
git clone https://github.com/earino/fast_sample
ln -s /Users/Tammy/github_repos/fast_sample/fast_sample /usr/local/bin
$time fast_sample -n 1000 ten_million.txt > /dev/null
fast_sample -n 1000 ten_million.txt > /dev/null 4.70s user 0.04s system 99% cpu 4.770 total
The take home message for the benchmarking is that Unix tools sometimes are better than tools you write in terms of speed and memory efficiency.

1 comment:

  1. Cheap Alexander McQueen uk, combining elegant style and cutting-edge technology, a variety of styles of replica Alexander McQueen womens bleach white oversized sneaker, the pointer walks between your exclusive taste style.

    ReplyDelete