Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, July 13, 2017

mount smb on ubuntu

I use sshfs to mount remote servers. but I also want to connecting windows servers to my ubuntu.

If there's one good thing that I can say about Windows XP is that it supports the SMB protocol. This enables a computer running Windows to share files, folders, and more with another PC. All that other PC needs is the right software to take advantage of the SMB protocol. Luckily, that software is available for GNU/Linux.

on mac, I can click the Finder bar --->Go---> Connect to Server and then type in the address. I will show you how to do it on ubuntu.

Install

First, install cifs-utils

sudo apt-get install cifs-utils

I got Hash Sum mismatch errors:

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  keyutils libsmbclient libwbclient0 python-crypto python-ldb python-samba python-tdb samba-common samba-common-bin samba-libs
Suggested packages:
  smbclient winbind python-crypto-dbg python-crypto-doc heimdal-clients
The following NEW packages will be installed:
  cifs-utils keyutils python-crypto python-ldb python-samba python-tdb samba-common samba-common-bin
The following packages will be upgraded:
  libsmbclient libwbclient0 samba-libs
3 upgraded, 8 newly installed, 0 to remove and 353 not upgraded.
Need to get 7,317 kB of archives.
After this operation, 11.5 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 samba-libs amd64 2:4.3.11+dfsg-0ubuntu0.16.04.8 [5,178 kB]
Err:1 http://security.ubuntu.com/ubuntu xenial-security/main amd64 samba-libs amd64 2:4.3.11+dfsg-0ubuntu0.16.04.8
  Hash Sum mismatch

After googel around, I

sudo apt-get clean

# now it works
sudo apt-get update
sudo apt-get install cifs-utils

Mount

# make a folder where the remote server will be mounted
sudo mkdir /mnt/genomic_med
 
sudo mount -t cifs -o username=mtang1 //d1prpccifs/genomic_med /mnt/genomic_med
#You will be promoted to type in the password.
Password for mtang1@//d1prpccifs/genomic_med:  ********

check if you can access the mounted server:

ls /mnt/genomic_med

Worked :)

Monday, July 10, 2017

cores, cpus and threads

Some reading for the basics

corescpus and threads :
http://www.slac.stanford.edu/comp/unix/package/lsf/currdoc/lsf_admin/index.htm?lim_core_detection.html~main
Traditionally, the value of ncpus has been equal to the number of physical CPUs. However, many CPUs consist of multiple cores and threads, so the traditional 1:1 mapping is no longer useful. A more useful approach is to set ncpus to equal one of the following:
  • The number of processors
  • Cores—the number of cores (per processor) * the number of processors (this is the ncpus default setting)
  • Threads—the number of threads (per core) * the number of cores (per processor) * the number of processors
Understanding Linux CPU Load - when should you be worried?
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

Quote from our HPC Admin

From our HPC admin Sally Boyd:
On our systems there are actually 2 CPUs with 12 Cores each for a total of 24 ppn (processors per node).
We use CPU and Core interchangeably, but we shouldn’t. We do not use hyperthreading on any of our clusters because it breaks the MPI software (message passing interface, used for multi-node processing). You can consider one thread per processor/core. So the most threads you can have is 24. If various parts of your pipeline use multiple threads and they’re running at the same time, you might want to be sure that all of those add up to 24 and no more. The other thing is that there is some relatively new (to us) code out there that calls a multi-threaded R without specifying number of threads, or else it starts up several iterations of itself, such that the scheduler is not aware. This causes lots of issues. I don’t recall if the code you were running previously that used so many resources was one of those or not.

My problem

I was runnning parallellized freebayes on cluster and needed to specify the number of cores.https://github.com/ekg/freebayes/blob/master/scripts/freebayes-parallel
The command I run:
./freebayes-parallel regions_to_include_freebayes.bed 4 -f {config[ref_fa]} \
        --genotype-qualities \
        --ploidy 2 \
        --min-repeat-entropy 1 \
        --no-partial-observations \
        --report-genotype-likelihood-max \
        {params.outputdir}/{input[0]} {params.outputdir}/{output} 2> {params.outputdir}/{log} 
        
it uses GNU parallel under the hood.
regionsfile=$1
shift
ncpus=$1
shift

command=("freebayes" "$@")

(
#$command | head -100 | grep "^#" # generate header
# iterate over regions using gnu parallel to dispatch jobs
cat "$regionsfile" | parallel -k -j "$ncpus" "${command[@]}" --region {}
) | ../vcflib/scripts/vcffirstheader \
  | ../vcflib/bin/vcfstreamsort -w 1000 \
  | vcfuniq # remove duplicates at region edges
Note that freebayes-parallel was hard-coded ../vcflib/.. one can put the vcflib bin to PATH, and call vcffirstheader and vcfstreamsort directly.
How many threads will be used? In my command, I specified -j 4. effectively, the commands is
(cat regions_to_include_freebayes.bed \
| parallel -k -j 4 "freebayes --region {} -f {config[ref_fa]} \
        --genotype-qualities \
        --ploidy 2 \
        --min-repeat-entropy 1 \
        --no-partial-observations \
        --report-genotype-likelihood-max \
        {params.outputdir}/my.sorted.bam 2> {params.outputdir}/{log})  \
| vcffirstheader \
| vcfstreamsort -w 1000 \
| vcfuniq > {params.outputdir}/{output}

At least 1 cat + 4(-j) + 3 (pipes) = 8 threads will be used.
checking how many cores I have in the computing nodes:
cat /proc/cpuinfo | grep "model name"
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz

grep "model name" /proc/cpuinfo | wc -l
24
I reserved 12 cores to run the job. checking the job after submitting:
bjobs -l 220806 

## some output
 RUNLIMIT                
 1440.0 min of chms025

 MEMLIMIT
     32 G 
Mon Jul  3 16:06:49: Started 12 Task(s) on Host(s) <chms025> <chms025> <chms025
                     > <chms025> <chms025> <chms025> <chms025> <chms025> <chms0
                     25> <chms025> <chms025> <chms025>, Allocated 12 Slot(s) on
                      Host(s) <chms025> <chms025> <chms025> <chms025> <chms025>
                     <chms025> <chms025> <chms025> <chms025> <chms025> <chms025
                     > <chms025>, Execution Home </rsrch2/genomic_med/krai>, Ex
                     ecution CWD </rsrch2/genomic_med/krai/scratch/TCGA_CCLE_SK
                     CM/TCGA_SKCM_FINAL_downsample_RUN/SNV_calling>;
Mon Jul  3 21:15:41: Resource usage collected.
                     The CPU time used is 2132 seconds.
                     MEM: 1.1 Gbytes;  SWAP: 2.3 Gbytes;  **NTHREAD: 17**
                     PGID: 26713;  PIDs: 26713 26719 26722 26729 26734 26783 
                     26784 26786 26788 1301 1302 1303 1304 26785 26787 26789 


 MEMORY USAGE:
 MAX MEM: 1.9 Gbytes;  AVG MEM: 1 Gbytes
It says 17 threads are used.
I went to the computing nodes, and checked PIDs related to my job:
ssh chms025
uptime
21:19:39 up 410 days,  9:33,  1 user,  load average: **5.94, 5.91, 5.87**
 
top -u krai -M -n 1 -b | grep krai
32381 krai      20   0  486m 314m 1808 R 100.0  0.1   0:01.37 freebayes                                                 
32382 krai      20   0  240m 224m 1808 R 98.4  0.1   0:01.15 freebayes                                                  
32360 krai      20   0  195m 179m 1912 R 92.6  0.0   0:02.95 freebayes                                                  
32390 krai      20   0  204m 188m 1808 R 54.0  0.0   0:00.28 freebayes                                                  
32388 krai      20   0 15568 1648  848 R  1.9  0.0   0:00.02 top                                                        
26713 krai      20   0 20388 2684 1460 S  0.0  0.0   0:41.56 res                                                        
26719 krai      20   0  103m 1256 1032 S  0.0  0.0   0:00.00 1499116008.2208                                            
26722 krai      20   0  103m  804  556 S  0.0  0.0   0:00.00 1499116008.2208                                            
26729 krai      20   0  258m  22m 4352 S  0.0  0.0   0:02.19 python                                                     
26734 krai      20   0  105m 1420 1144 S  0.0  0.0   0:00.00 bash                                                       
26783 krai      20   0  103m 1300 1060 S  0.0  0.0   0:00.00 freebayes-paral                                            
26784 krai      20   0  103m  488  244 S  0.0  0.0   0:00.00 freebayes-paral                                            
26785 krai      20   0  115m 4872 1928 S  0.0  0.0   0:05.03 python                                                     
26786 krai      20   0  100m 1288  480 S  0.0  0.0   0:00.00 cat                                                        
26787 krai      20   0 29152  11m 1344 S  0.0  0.0   1:46.80 vcfstreamsort                                              
26788 krai      20   0  139m 9.9m 2036 S  0.0  0.0   1:11.87 perl                                                       
26789 krai      20   0 21156 1580 1308 S  0.0  0.0   1:34.24 vcfuniq                                                    
31906 krai      20   0 96072 1768  840 S  0.0  0.0   0:00.00 sshd                                                       
31907 krai      20   0  106m 2076 1464 S  0.0  0.0   0:00.07 bash                                                       
32389 krai      20   0  100m  836  732 S  0.0  0.0   0:00.00 grep       
Indeed, there are 4 freebayes (-j 4 from parallel) are running. 1 cat, 1 vcfstreamsort, 1 vcfuniq, not sure where are the 2 python, 1 grep, 1 perl, 2 bash from. My guess is that some scripts are wrapped shell scripts.