Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Friday, March 25, 2016

The most powerful unix commands I have learned so far: find + parallel

I have been using unix since 2013, yet I learn unix tricks almost everyday. The most powerful commands I have learned so far is the find, xargs and parallel commands.

please check parallel GNU page for documentations.
parallel has changed my way to do repetitive works. Now, I use fewer and fewer for loops.
Use case 1: I have 100 folders with names starting with H3K4me3, inside each folder, I have 5 .gz files that I want to cat together. The usual way to do it:
# !/bin/bash

for dir in H3K4me3*/
do
    cd $dir && cat *H3K4me3.bed.gz > ${dir}_merged.gz 
    cd ..
done
Note that cat works well with *gz files.
The parallel way:
 ls -d H3K4me3* | parallel 'find {} -name "*H3K4me3*bed.gz" | xargs cat > {}_H3K4me3.bed.gz'
Using parallel, I can take full advantage of the multi-core nodes on the computing cluster, so it is much faster.
Use case 2: I have 100 folders (50 folder names start with H3K4me3, 50 start with H3K4me), each folder has multiple levels of sub-folders. I want to delete some bam files in 50 of them with name starting with H3K4me3, but I do not know which sub-folder the bam files may exist.
I do not really know a way to do it without using find. My solution would be:
ls -d H3K4me3* | parallel 'find {} -name "*bam"' | parallel rm {}
piping to two parallel is the magic of this solution. Unix commands are elegant and efficient!!

Edit on 04/04/2016:

With greater power comes greater responsibility. When you have too many files to process,
it is good to restrict parallel to only use certain number of CPUs with -j and not use swap-memory --noswap.

7 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Nice intro to parallel...

    In your bash script in use case #1, you're missing a 'cd ..' in the loop it seems.

    For use case #2, could your 'double parallel' be replaced with this?
    find H3K4me3* -name "*bam" -delete

    Nico Stransky

    ReplyDelete
  3. Nice intro to parallel...

    In your bash script in use case #1, you're missing a 'cd ..' in the loop it seems.

    For use case #2, could your 'double parallel' be replaced with this?
    find H3K4me3* -name "*bam" -delete

    Nico Stransky

    ReplyDelete
    Replies
    1. Thx Nico, I edited the #1 accordingly. for case#2, it should be find H3K4me3 -name "*bam" -exec rm -rf {} \;

      Delete
    2. You are correct in the case of directories. For simple files, '-delete' works.
      Nico

      Delete
    3. because you have -name "*bam" in the command, I assumed you were only looking to delete files. -delete will work in that case. To ensure that 'find' only returns files, you can add '-type f'.
      Nico

      Delete