The problem
I am running pyclone recently to determine the clonality of our collected tumor samples. It needs tumor purity (estimated from sequenza) for input. I have 24 samples, and I want to generate pyclone commands for all of them.
The solution
I usually generate command by bash and then use another bash wrapper to generate pbs files on HPC ((Thanks to @SBAmin). now for each patient, I have two samples. How should I generate the commands? I am still better in R and Unix than python, so I used the associated array in bash.
First, generate a file containing the tumor purity for each tumor:
head -6 all_tumor_purity_no_header.txt
0.69 Pa25T1
0.26 Pa25T2
0.49 Pa26T1
0.37 Pa26T2
0.9 Pa27T1
0.92 Pa27T2
This bash script uses associate array to contain tumor purity. read more at http://www.artificialworlds.net/blog/2012/10/17/bash-associative-array-examples/
make_commands.sh
:#! /bin/bash
set -euo pipefail
## build the array to contain the tumor purity, like a python dict
## have to declare by -A
declare -A cols
while read purity sample
do
cols[$sample]=$purity
done < all_purity_no_header.txt
echo ${cols[@]}
## generate commands
for i in Pa{25..37}
do
echo PyClone run_analysis_pipeline --in_file ${i}T1_pyclone.tsv ${i}T2_pyclone.tsv --tumour_contents ${cols[${i}T1]} ${cols[${i}T2]} --samples ${i}T1 ${i}T2 --density pyclone_binomial --working_dir ${i}T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000 > ${i}_pyclone_commands.txt
done
chmod u+x make_commands.sh
./make_commands.sh
what you get:
cat *commands.txt
PyClone run_analysis_pipeline --in_file Pa25T1_pyclone.tsv Pa25T2_pyclone.tsv --tumour_contents 0.69 0.26 --samples Pa25T1 Pa25T2 --density pyclone_binomial --working_dir Pa25T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000
PyClone run_analysis_pipeline --in_file Pa26T1_pyclone.tsv Pa26T2_pyclone.tsv --tumour_contents 0.49 0.37 --samples Pa26T1 Pa26T2 --density pyclone_binomial --working_dir Pa26T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000
PyClone run_analysis_pipeline --in_file Pa27T1_pyclone.tsv Pa27T2_pyclone.tsv --tumour_contents 0.9 0.92 --samples Pa27T1 Pa27T2 --density pyclone_binomial --working_dir Pa27T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000
....
then:
use the makemsub wrapper:
use the makemsub wrapper:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# Wrapper to make MSUB job format on HPC running Moab/Torque job scheduler. | |
# @sbamin | nautilus | |
## getopts schema is modified from from script by @r_sabarinathan | |
set -e | |
set -u | |
set -o pipefail | |
# usage | |
show_help() { | |
cat << EOF | |
Wrapper to make BSUB job format on HPC running Moab/Torque job scheduler. | |
Only required parameter is path to file containing commands to be run on cluster. | |
This file will be copied verbatim following MSUB arguments. | |
Default MSUB options are: medium queue with 2 hours walltime, arpprox 16GB RAM and 4 CPU cores with present work directory as current work directory. | |
Usage: ${0##*/} -a <path to files containing commands> > <job.msub>" | |
-h display this help and exit | |
-j job name (default: j<random id>_username) | |
-w work directory (default: present work directory) | |
-t walltime in HH:MM:SS (default: 02:00:00) | |
-m memory in gb (default: 16gb) | |
-n number of nodes (default: 1) | |
-c cpu cores per node (default: 4) | |
-o email notifications (default: ae) | |
-e extra options to MSUB (default: none) | |
-a REQUIRED: path to file containing commands to be run on cluster. This file will be copied verbatim following MSUB arguments. | |
Example: ${0##*/} -j "sample_job" -w "/home/foo/myworkdir" -t 26:00:00 -m 64gb -n 1 -c 24 -o e -a "/home/foo/mycommands.txt" > /home/foo/sample.msub | |
Quotes are important for variable names containig spaces and special characters. | |
EOF | |
} | |
if [[ $# == 0 ]];then show_help;exit 1;fi | |
# read input | |
expression=0 | |
while getopts "j:w:q:t:m:n:c:o:e:a:h" opt; do | |
case "$opt" in | |
h) show_help;exit 0;; | |
j) JOBNAME=$OPTARG;; | |
w) CWD=$OPTARG;; | |
t) WALLTIME=$OPTARG;; | |
m) MEMORY=$OPTARG;; | |
n) NODES=$OPTARG;; | |
c) CPU=$OPTARG;; | |
o) EMAILOPTS=$OPTARG;; | |
e) EXTRA_OPTS=$OPTARG;; | |
a) MYARGS=$OPTARG;; | |
'?')show_help >&2; exit 1 ;; | |
esac | |
done | |
DJOBID=$(printf "j%s_%s" "$RANDOM" "$(whoami)") | |
JOBNAME=${JOBNAME:-$DJOBID} | |
CWD=${CWD:-$(pwd)} | |
STDOUT=$(printf "%s/log_%s.out" ${CWD} $JOBNAME) | |
STDERR=$(printf "%s/log_%s.err" ${CWD} $JOBNAME) | |
WALLTIME=${WALLTIME:-"02:00:00"} | |
MEMORY=${MEMORY:-"16gb"} | |
NODES=${NODES:-"1"} | |
CPU=${CPU:-"4"} | |
EMAILOPTS=${EMAILOPTS:-"ae"} | |
if [[ ! -s ${MYARGS} ]];then | |
echo -e "\nERROR: Command file either does not exist at ${MYARGS} location or empty.\n" | |
show_help | |
exit 1 | |
fi | |
##### Following lsf block will be parsed based on arguments supplied ##### | |
cat <<EOF | |
#!/bin/bash | |
#MSUB -N ${JOBNAME} # name of the job | |
#MSUB -d ${CWD} # the workding dir for each job, this is <flow_run_path>/uniqueid/tmp | |
#MSUB -o ${STDOUT} # output is sent to logfile, stdout + stderr by default | |
#MSUB -e ${STDERR} # output is sent to logfile, stdout + stderr by default | |
#MSUB -l walltime=${WALLTIME} # Walltime in minutes | |
#MSUB -l mem=${MEMORY} # Memory requirements in Kbytes | |
#MSUB -l nodes=${NODES}:ppn=${CPU} # CPU reserved | |
#MSUB -M xxx@mdanderson.org # for notifications | |
#MSUB -m ${EMAILOPTS} # send email when job ends | |
#MSUB -r y # make the jobs re-runnable | |
#MSUB -S /bin/bash # use bash shell | |
#MSUB -V | |
#MSUB ${EXTRA_OPTS} # Any extra arguments passed onto queue | |
## following MSUB options are not being used at present. | |
# For HPC Nautilus at MDAnderson: Remove QUEUE option of MSUB else job will fail. Queue will be determined based on walltime argument. | |
##MSUB ${DEPENDENCY} # Do not remove dependency args come here | |
## --- DO NOT EDIT from below here---- ## | |
## following will always overwrite previous output file, if any. | |
set +o noclobber | |
$(printf "echo \"BEGIN at \$(date)\" >> %s" ${STDOUT}) | |
## File containing commands will be copied here verbatim ## | |
###################### START USER SUPPLIED COMMANDS ###################### | |
$(cat ${MYARGS}) | |
###################### END USER SUPPLIED COMMANDS ###################### | |
exitstat=\$? | |
$(printf "echo \"END at \$(date)\" >> %s" ${STDOUT}) | |
$(printf "echo \"exit status was \${exitstat}\" >> %s" ${STDOUT}) | |
$(printf "exit \${exitstat}") | |
## END ## | |
EOF |
find *commands.txt | parallel 'makemsub -a {} -j {/.} -o a -c 4 -t "48:00:00" -m 16g > {/.}.pbs'
for pbs in *pbs
do
msub $pbs
sleep 1
done
I usually do it for simple tasks. e.g. only one or two commands are evoked. For more complex workflows, a workflow tool such as snakemake is better.