Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Friday, February 24, 2017

use bash associate array to generate commands

The problem

I am running pyclone recently to determine the clonality of our collected tumor samples. It needs tumor purity (estimated from sequenza) for input. I have 24 samples, and I want to generate pyclone commands for all of them.

The solution

I usually generate command by bash and then use another bash wrapper to generate pbs files on HPC ((Thanks to @SBAmin). now for each patient, I have two samples. How should I generate the commands? I am still better in R and Unix than python, so I used the associated array in bash.
First, generate a file containing the tumor purity for each tumor:
head -6 all_tumor_purity_no_header.txt
0.69    Pa25T1
0.26    Pa25T2
0.49    Pa26T1
0.37    Pa26T2
0.9 Pa27T1
0.92    Pa27T2
This bash script uses associate array to contain tumor purity. read more at http://www.artificialworlds.net/blog/2012/10/17/bash-associative-array-examples/
make_commands.sh:
#! /bin/bash
set -euo pipefail

## build the array to contain the tumor purity, like a python dict
## have to declare by -A
declare -A cols

while read purity sample
do
    cols[$sample]=$purity
done < all_purity_no_header.txt 

echo ${cols[@]}

## generate commands
for i in Pa{25..37}
do
   echo PyClone run_analysis_pipeline --in_file ${i}T1_pyclone.tsv ${i}T2_pyclone.tsv --tumour_contents ${cols[${i}T1]} ${cols[${i}T2]} --samples ${i}T1 ${i}T2 --density pyclone_binomial --working_dir ${i}T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000 > ${i}_pyclone_commands.txt
done
chmod u+x make_commands.sh
./make_commands.sh
what you get: cat *commands.txt
PyClone run_analysis_pipeline --in_file Pa25T1_pyclone.tsv Pa25T2_pyclone.tsv --tumour_contents 0.69 0.26 --samples Pa25T1 Pa25T2 --density pyclone_binomial --working_dir Pa25T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000
PyClone run_analysis_pipeline --in_file Pa26T1_pyclone.tsv Pa26T2_pyclone.tsv --tumour_contents 0.49 0.37 --samples Pa26T1 Pa26T2 --density pyclone_binomial --working_dir Pa26T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000
PyClone run_analysis_pipeline --in_file Pa27T1_pyclone.tsv Pa27T2_pyclone.tsv --tumour_contents 0.9 0.92 --samples Pa27T1 Pa27T2 --density pyclone_binomial --working_dir Pa27T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000
....
then:
use the makemsub wrapper:

#!/bin/bash
# Wrapper to make MSUB job format on HPC running Moab/Torque job scheduler.
# @sbamin | nautilus
## getopts schema is modified from from script by @r_sabarinathan
set -e
set -u
set -o pipefail
# usage
show_help() {
cat << EOF
Wrapper to make BSUB job format on HPC running Moab/Torque job scheduler.
Only required parameter is path to file containing commands to be run on cluster.
This file will be copied verbatim following MSUB arguments.
Default MSUB options are: medium queue with 2 hours walltime, arpprox 16GB RAM and 4 CPU cores with present work directory as current work directory.
Usage: ${0##*/} -a <path to files containing commands> > <job.msub>"
-h display this help and exit
-j job name (default: j<random id>_username)
-w work directory (default: present work directory)
-t walltime in HH:MM:SS (default: 02:00:00)
-m memory in gb (default: 16gb)
-n number of nodes (default: 1)
-c cpu cores per node (default: 4)
-o email notifications (default: ae)
-e extra options to MSUB (default: none)
-a REQUIRED: path to file containing commands to be run on cluster. This file will be copied verbatim following MSUB arguments.
Example: ${0##*/} -j "sample_job" -w "/home/foo/myworkdir" -t 26:00:00 -m 64gb -n 1 -c 24 -o e -a "/home/foo/mycommands.txt" > /home/foo/sample.msub
Quotes are important for variable names containig spaces and special characters.
EOF
}
if [[ $# == 0 ]];then show_help;exit 1;fi
# read input
expression=0
while getopts "j:w:q:t:m:n:c:o:e:a:h" opt; do
case "$opt" in
h) show_help;exit 0;;
j) JOBNAME=$OPTARG;;
w) CWD=$OPTARG;;
t) WALLTIME=$OPTARG;;
m) MEMORY=$OPTARG;;
n) NODES=$OPTARG;;
c) CPU=$OPTARG;;
o) EMAILOPTS=$OPTARG;;
e) EXTRA_OPTS=$OPTARG;;
a) MYARGS=$OPTARG;;
'?')show_help >&2; exit 1 ;;
esac
done
DJOBID=$(printf "j%s_%s" "$RANDOM" "$(whoami)")
JOBNAME=${JOBNAME:-$DJOBID}
CWD=${CWD:-$(pwd)}
STDOUT=$(printf "%s/log_%s.out" ${CWD} $JOBNAME)
STDERR=$(printf "%s/log_%s.err" ${CWD} $JOBNAME)
WALLTIME=${WALLTIME:-"02:00:00"}
MEMORY=${MEMORY:-"16gb"}
NODES=${NODES:-"1"}
CPU=${CPU:-"4"}
EMAILOPTS=${EMAILOPTS:-"ae"}
if [[ ! -s ${MYARGS} ]];then
echo -e "\nERROR: Command file either does not exist at ${MYARGS} location or empty.\n"
show_help
exit 1
fi
##### Following lsf block will be parsed based on arguments supplied #####
cat <<EOF
#!/bin/bash
#MSUB -N ${JOBNAME} # name of the job
#MSUB -d ${CWD} # the workding dir for each job, this is <flow_run_path>/uniqueid/tmp
#MSUB -o ${STDOUT} # output is sent to logfile, stdout + stderr by default
#MSUB -e ${STDERR} # output is sent to logfile, stdout + stderr by default
#MSUB -l walltime=${WALLTIME} # Walltime in minutes
#MSUB -l mem=${MEMORY} # Memory requirements in Kbytes
#MSUB -l nodes=${NODES}:ppn=${CPU} # CPU reserved
#MSUB -M xxx@mdanderson.org # for notifications
#MSUB -m ${EMAILOPTS} # send email when job ends
#MSUB -r y # make the jobs re-runnable
#MSUB -S /bin/bash # use bash shell
#MSUB -V
#MSUB ${EXTRA_OPTS} # Any extra arguments passed onto queue
## following MSUB options are not being used at present.
# For HPC Nautilus at MDAnderson: Remove QUEUE option of MSUB else job will fail. Queue will be determined based on walltime argument.
##MSUB ${DEPENDENCY} # Do not remove dependency args come here
## --- DO NOT EDIT from below here---- ##
## following will always overwrite previous output file, if any.
set +o noclobber
$(printf "echo \"BEGIN at \$(date)\" >> %s" ${STDOUT})
## File containing commands will be copied here verbatim ##
###################### START USER SUPPLIED COMMANDS ######################
$(cat ${MYARGS})
###################### END USER SUPPLIED COMMANDS ######################
exitstat=\$?
$(printf "echo \"END at \$(date)\" >> %s" ${STDOUT})
$(printf "echo \"exit status was \${exitstat}\" >> %s" ${STDOUT})
$(printf "exit \${exitstat}")
## END ##
EOF
view raw makemsub.sh hosted with ❤ by GitHub

find *commands.txt | parallel 'makemsub -a {} -j {/.} -o a -c 4 -t "48:00:00" -m 16g > {/.}.pbs'

for pbs in *pbs
do
  msub $pbs
  sleep 1
done
I usually do it for simple tasks. e.g. only one or two commands are evoked. For more complex workflows, a workflow tool such as snakemake is better.

No comments:

Post a Comment