Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Friday, March 25, 2016

The most powerful unix commands I have learned so far: find + parallel

I have been using unix since 2013, yet I learn unix tricks almost everyday. The most powerful commands I have learned so far is the find, xargs and parallel commands.

please check parallel GNU page for documentations.
parallel has changed my way to do repetitive works. Now, I use fewer and fewer for loops.
Use case 1: I have 100 folders with names starting with H3K4me3, inside each folder, I have 5 .gz files that I want to cat together. The usual way to do it:
# !/bin/bash

for dir in H3K4me3*/
do
    cd $dir && cat *H3K4me3.bed.gz > ${dir}_merged.gz 
    cd ..
done
Note that cat works well with *gz files.
The parallel way:
 ls -d H3K4me3* | parallel 'find {} -name "*H3K4me3*bed.gz" | xargs cat > {}_H3K4me3.bed.gz'
Using parallel, I can take full advantage of the multi-core nodes on the computing cluster, so it is much faster.
Use case 2: I have 100 folders (50 folder names start with H3K4me3, 50 start with H3K4me), each folder has multiple levels of sub-folders. I want to delete some bam files in 50 of them with name starting with H3K4me3, but I do not know which sub-folder the bam files may exist.
I do not really know a way to do it without using find. My solution would be:
ls -d H3K4me3* | parallel 'find {} -name "*bam"' | parallel rm {}
piping to two parallel is the magic of this solution. Unix commands are elegant and efficient!!

Edit on 04/04/2016:

With greater power comes greater responsibility. When you have too many files to process,
it is good to restrict parallel to only use certain number of CPUs with -j and not use swap-memory --noswap.

8 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Nice intro to parallel...

    In your bash script in use case #1, you're missing a 'cd ..' in the loop it seems.

    For use case #2, could your 'double parallel' be replaced with this?
    find H3K4me3* -name "*bam" -delete

    Nico Stransky

    ReplyDelete
  3. Nice intro to parallel...

    In your bash script in use case #1, you're missing a 'cd ..' in the loop it seems.

    For use case #2, could your 'double parallel' be replaced with this?
    find H3K4me3* -name "*bam" -delete

    Nico Stransky

    ReplyDelete
    Replies
    1. Thx Nico, I edited the #1 accordingly. for case#2, it should be find H3K4me3 -name "*bam" -exec rm -rf {} \;

      Delete
    2. You are correct in the case of directories. For simple files, '-delete' works.
      Nico

      Delete
    3. because you have -name "*bam" in the command, I assumed you were only looking to delete files. -delete will work in that case. To ensure that 'find' only returns files, you can add '-type f'.
      Nico

      Delete
  4. All thanks to Mr Anderson for helping with my profits and making my fifth withdrawal possible. I'm here to share an amazing life changing opportunity with you. its called Bitcoin / Forex trading options. it is a highly lucrative business which can earn you as much as $2,570 in a week from an initial investment of just $200. I am living proof of this great business opportunity. If anyone is interested in trading on bitcoin or any cryptocurrency and want a successful trade without losing notify Mr Anderson now.Whatsapp: (+447883246472 )
    Email: tdameritrade077@gmail.com

    ReplyDelete