Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Monday, July 10, 2017

cores, cpus and threads

Some reading for the basics

corescpus and threads :
http://www.slac.stanford.edu/comp/unix/package/lsf/currdoc/lsf_admin/index.htm?lim_core_detection.html~main
Traditionally, the value of ncpus has been equal to the number of physical CPUs. However, many CPUs consist of multiple cores and threads, so the traditional 1:1 mapping is no longer useful. A more useful approach is to set ncpus to equal one of the following:
  • The number of processors
  • Cores—the number of cores (per processor) * the number of processors (this is the ncpus default setting)
  • Threads—the number of threads (per core) * the number of cores (per processor) * the number of processors
Understanding Linux CPU Load - when should you be worried?
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

Quote from our HPC Admin

From our HPC admin Sally Boyd:
On our systems there are actually 2 CPUs with 12 Cores each for a total of 24 ppn (processors per node).
We use CPU and Core interchangeably, but we shouldn’t. We do not use hyperthreading on any of our clusters because it breaks the MPI software (message passing interface, used for multi-node processing). You can consider one thread per processor/core. So the most threads you can have is 24. If various parts of your pipeline use multiple threads and they’re running at the same time, you might want to be sure that all of those add up to 24 and no more. The other thing is that there is some relatively new (to us) code out there that calls a multi-threaded R without specifying number of threads, or else it starts up several iterations of itself, such that the scheduler is not aware. This causes lots of issues. I don’t recall if the code you were running previously that used so many resources was one of those or not.

My problem

I was runnning parallellized freebayes on cluster and needed to specify the number of cores.https://github.com/ekg/freebayes/blob/master/scripts/freebayes-parallel
The command I run:
./freebayes-parallel regions_to_include_freebayes.bed 4 -f {config[ref_fa]} \
        --genotype-qualities \
        --ploidy 2 \
        --min-repeat-entropy 1 \
        --no-partial-observations \
        --report-genotype-likelihood-max \
        {params.outputdir}/{input[0]} {params.outputdir}/{output} 2> {params.outputdir}/{log} 
        
it uses GNU parallel under the hood.
regionsfile=$1
shift
ncpus=$1
shift

command=("freebayes" "$@")

(
#$command | head -100 | grep "^#" # generate header
# iterate over regions using gnu parallel to dispatch jobs
cat "$regionsfile" | parallel -k -j "$ncpus" "${command[@]}" --region {}
) | ../vcflib/scripts/vcffirstheader \
  | ../vcflib/bin/vcfstreamsort -w 1000 \
  | vcfuniq # remove duplicates at region edges
Note that freebayes-parallel was hard-coded ../vcflib/.. one can put the vcflib bin to PATH, and call vcffirstheader and vcfstreamsort directly.
How many threads will be used? In my command, I specified -j 4. effectively, the commands is
(cat regions_to_include_freebayes.bed \
| parallel -k -j 4 "freebayes --region {} -f {config[ref_fa]} \
        --genotype-qualities \
        --ploidy 2 \
        --min-repeat-entropy 1 \
        --no-partial-observations \
        --report-genotype-likelihood-max \
        {params.outputdir}/my.sorted.bam 2> {params.outputdir}/{log})  \
| vcffirstheader \
| vcfstreamsort -w 1000 \
| vcfuniq > {params.outputdir}/{output}

At least 1 cat + 4(-j) + 3 (pipes) = 8 threads will be used.
checking how many cores I have in the computing nodes:
cat /proc/cpuinfo | grep "model name"
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz

grep "model name" /proc/cpuinfo | wc -l
24
I reserved 12 cores to run the job. checking the job after submitting:
bjobs -l 220806 

## some output
 RUNLIMIT                
 1440.0 min of chms025

 MEMLIMIT
     32 G 
Mon Jul  3 16:06:49: Started 12 Task(s) on Host(s) <chms025> <chms025> <chms025
                     > <chms025> <chms025> <chms025> <chms025> <chms025> <chms0
                     25> <chms025> <chms025> <chms025>, Allocated 12 Slot(s) on
                      Host(s) <chms025> <chms025> <chms025> <chms025> <chms025>
                     <chms025> <chms025> <chms025> <chms025> <chms025> <chms025
                     > <chms025>, Execution Home </rsrch2/genomic_med/krai>, Ex
                     ecution CWD </rsrch2/genomic_med/krai/scratch/TCGA_CCLE_SK
                     CM/TCGA_SKCM_FINAL_downsample_RUN/SNV_calling>;
Mon Jul  3 21:15:41: Resource usage collected.
                     The CPU time used is 2132 seconds.
                     MEM: 1.1 Gbytes;  SWAP: 2.3 Gbytes;  **NTHREAD: 17**
                     PGID: 26713;  PIDs: 26713 26719 26722 26729 26734 26783 
                     26784 26786 26788 1301 1302 1303 1304 26785 26787 26789 


 MEMORY USAGE:
 MAX MEM: 1.9 Gbytes;  AVG MEM: 1 Gbytes
It says 17 threads are used.
I went to the computing nodes, and checked PIDs related to my job:
ssh chms025
uptime
21:19:39 up 410 days,  9:33,  1 user,  load average: **5.94, 5.91, 5.87**
 
top -u krai -M -n 1 -b | grep krai
32381 krai      20   0  486m 314m 1808 R 100.0  0.1   0:01.37 freebayes                                                 
32382 krai      20   0  240m 224m 1808 R 98.4  0.1   0:01.15 freebayes                                                  
32360 krai      20   0  195m 179m 1912 R 92.6  0.0   0:02.95 freebayes                                                  
32390 krai      20   0  204m 188m 1808 R 54.0  0.0   0:00.28 freebayes                                                  
32388 krai      20   0 15568 1648  848 R  1.9  0.0   0:00.02 top                                                        
26713 krai      20   0 20388 2684 1460 S  0.0  0.0   0:41.56 res                                                        
26719 krai      20   0  103m 1256 1032 S  0.0  0.0   0:00.00 1499116008.2208                                            
26722 krai      20   0  103m  804  556 S  0.0  0.0   0:00.00 1499116008.2208                                            
26729 krai      20   0  258m  22m 4352 S  0.0  0.0   0:02.19 python                                                     
26734 krai      20   0  105m 1420 1144 S  0.0  0.0   0:00.00 bash                                                       
26783 krai      20   0  103m 1300 1060 S  0.0  0.0   0:00.00 freebayes-paral                                            
26784 krai      20   0  103m  488  244 S  0.0  0.0   0:00.00 freebayes-paral                                            
26785 krai      20   0  115m 4872 1928 S  0.0  0.0   0:05.03 python                                                     
26786 krai      20   0  100m 1288  480 S  0.0  0.0   0:00.00 cat                                                        
26787 krai      20   0 29152  11m 1344 S  0.0  0.0   1:46.80 vcfstreamsort                                              
26788 krai      20   0  139m 9.9m 2036 S  0.0  0.0   1:11.87 perl                                                       
26789 krai      20   0 21156 1580 1308 S  0.0  0.0   1:34.24 vcfuniq                                                    
31906 krai      20   0 96072 1768  840 S  0.0  0.0   0:00.00 sshd                                                       
31907 krai      20   0  106m 2076 1464 S  0.0  0.0   0:00.07 bash                                                       
32389 krai      20   0  100m  836  732 S  0.0  0.0   0:00.00 grep       
Indeed, there are 4 freebayes (-j 4 from parallel) are running. 1 cat, 1 vcfstreamsort, 1 vcfuniq, not sure where are the 2 python, 1 grep, 1 perl, 2 bash from. My guess is that some scripts are wrapped shell scripts.

Wednesday, June 28, 2017

install VEP for annotating variants

VEP is one of the most commonly used variants annotation tools along with annovar, snpEff, but the installation and config can be very intimidating.

I just went through an installation process, and put down a gist:


Thursday, June 15, 2017

bwa aln or bwa mem for short reads (36bp)

My ChIP-seq data are 36bp single end reads. I usually use bowtie1 for mapping ChIP-seq reads, but bowtie1 does not handle indels. Since I want to call mutations on the ChIP-seq reads, I have to use another aligner BWA, the most popular mapper written by Heng Li.

The github page says if reads < 70bp, bwa aln should be used. Otherwise, bwa mem should be used.
bwa mem is a more recent algorithm (should be better?).

I searched on biostar, and found When and why is bwa aln better then bwa mem?

I did a simulation test using Teaser using default setting for each aligner.

The results are shown below:

The mapping rate:


Memory usage:


Run time:

Indeed, BWA aln is a little better than BWA mem for short reads.

For a real data set, the samtools flagstat results are shown below:

bwa aln:
282967631 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
18963259 + 0 duplicates
240660130 + 0 mapped (85.05% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

bwa mem:
282967631 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
18332921 + 0 duplicates
236558306 + 0 mapped (83.60% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Indeed, bwa aln has a moderate higher mapping rate and a shorter run time for short 36bp reads.

Monday, June 5, 2017

Weather and crimes in Chicago

How the story starts

I was attending 2017 American Society of Clinical Oncology (ASCO) annual meeting in Chicago. It was my first time to do a uber from the airport to our hotel. We had a nice uber driver and we talked a lot about the city. He said he grew up in the south part of Chicago and there are a lot of crimes there. He knows the city so well that he tries his best to avoid bad districts to pick up customers. When strangers meet, topic will always be weather. I certainly joked about the badness of winter in Chicago and he replied back with the badness of hotness in Houston.
He said there are around 600 cases of shooting per year in Chicago, and in the holidays such as Memorial day, shooting arises to 40 per day. I asked him why do you think there is such an increase. He said: I think it has to do with the weather. In the Memorial day, it becomes warmer. everybody has access to everybody. that’s why the crime rate is higher as well."
As a budding data scientist (that’s how I define myself) to interrogate data from DNA sequencing on tumor samples, I should not just take his words, I want to verify this by some evidences. Initially, I was surprised that weather can be associated with crime rate. Then I googled around when I got the hotel and it turns out that this notion is very common.
See a post on this topic Chicago Crime vs. Weather: Under the Hood. The author did a very good job to demonstrate the association of weather and crime rate in Chicago.
There is even a Kaggle channel for this topic.
In the post I linked, the analysis was done in python. I want to do some similar and more analysis using R. That’s why I am firing up Rstudioand start to type.

Download the crime data

To associate crime rate with weather, I need data sets for them respectively.
The city of Chicago has a very nice portol for the crime data.
I downloaded the crime data with the csv format. Note that the data set is relatively big (~1.5G). My computer has enough memory to hold the whole data set.
less -S ~/Downloads/Crimes_-_2001_to_present.csv | wc -l
# 6 million lines
##  6346217
# load in the libraries
library(tidyverse)
library(ggplot2)
library(lubridate)

crime<- read_csv("~/Downloads/Crimes_-_2001_to_present.csv", col_names =T)
## split the Date column, keep only the day information

crime<- crime %>% mutate(Day = gsub("([0-9]{2}/[0-9]{2}/[0-9]{4}) .+", "\\1", Date))

## change the Day column to a date type using mdy function from lubridate
crime<- crime %>% mutate(Day_date = mdy(Day)) %>% dplyr::select(-Date, -Day)

## what's the date range for the crime data?
range(crime$Day_date)
## [1] "2001-01-01" "2017-05-27"

Get the weather data

The weather data is a bit harder to get if you follow the linked post above. luckily, I found a prepared weather data set from the kaggle channel (was uploaded two days ago when I wrote this! lucky me).
UPDATE: unfortunately, the weather data set was not clean and there are many missing dates.
On twitter, Kevin Johnson @bigkage suggested a package weatherData.
only the developmental branch in github is working, the package on CRAN does not work…
Take a look at the rnoaa package as well. A post on it https://recology.info/2015/07/weather-data-with-rnoaa/
library("devtools")
#install_github("Ram-N/weatherData")
library(weatherData)

# O'Hare airport code is ORD, worked like a charm!
# other columns can be fetched. ?getWeatherForDate. for me, only temperature is needed
weather <- getWeatherForDate("ORD", "2001-01-01", "2017-05-27")
dim(weather)
However, only 388 records were found.
I opened an issue on github.
The reason for this is that weatherUnderground doesn’t want us to pull a huge volume of data in a single file. (For some reason, they truncate the number of rows to be 400 rows or less.)
To fetch all the data:
multi_weather <- lapply(as.character(2001:2017), 
                        function(x){getWeatherForYear("ORD", x)})

weather <- do.call(rbind, multi_weather)
merge the two data sets
str(weather)
## 'data.frame':    5977 obs. of  4 variables:
##  $ Date             : POSIXlt, format: "2001-01-01" "2001-01-02" ...
##  $ Max_TemperatureF : int  17 44 50 57 46 54 39 45 46 51 ...
##  $ Mean_TemperatureF: int  6 24 34 42 36 40 32 32 32 40 ...
##  $ Min_TemperatureF : int  -6 3 19 26 26 26 26 18 17 28 ...
str(crime)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6346216 obs. of  22 variables:
##  $ ID                  : int  5437456 5437458 5437459 5437460 5437461 5437462 5437463 5437464 5437465 5437466 ...
##  $ Case Number         : chr  "HN270582" "HN237768" "HN247565" "HN270399" ...
##  $ Block               : chr  "028XX E 90TH ST" "013XX S CHRISTIANA AVE" "024XX S SAWYER AVE" "086XX S LAFLIN ST" ...
##  $ IUCR                : chr  "1822" "2024" "2022" "1320" ...
##  $ Primary Type        : chr  "NARCOTICS" "NARCOTICS" "NARCOTICS" "CRIMINAL DAMAGE" ...
##  $ Description         : chr  "MANU/DEL:CANNABIS OVER 10 GMS" "POSS: HEROIN(WHITE)" "POSS: COCAINE" "TO VEHICLE" ...
##  $ Location Description: chr  "STREET" "SIDEWALK" "SIDEWALK" "OTHER" ...
##  $ Arrest              : chr  "true" "true" "true" "false" ...
##  $ Domestic            : chr  "false" "false" "false" "true" ...
##  $ Beat                : chr  "0423" "1021" "1024" "0614" ...
##  $ District            : chr  "004" "010" "010" "006" ...
##  $ Ward                : int  10 24 22 21 4 10 28 32 16 7 ...
##  $ Community Area      : int  46 29 30 71 35 55 27 7 61 48 ...
##  $ FBI Code            : chr  "18" "18" "18" "14" ...
##  $ X Coordinate        : int  1196743 1154244 1155082 1167841 1180453 1199105 1154131 1165540 1164423 1192802 ...
##  $ Y Coordinate        : int  1845841 1893642 1887583 1847481 1881415 1816541 1900784 1917732 1870859 1843576 ...
##  $ Year                : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Updated On          : chr  "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" ...
##  $ Latitude            : num  41.7 41.9 41.8 41.7 41.8 ...
##  $ Longitude           : num  -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ Location            : chr  "(41.73185728, -87.55483341)" "(41.863979388, -87.709253509)" "(41.84733605, -87.706339649)" "(41.737026255, -87.660665675)" ...
##  $ Day_date            : Date, format: "2007-04-07" "2007-03-20" ...
# change the POSIXlt to date 
weather$Date<- as.Date(weather$Date)
crime_weather<- inner_join(crime, weather, by = c("Day_date" = "Date"))

## Note that weather information on some dates were missing
anti_join(crime, weather, by = c("Day_date" = "Date")) %>% .$Day_date %>% unique()
##  [1] "2005-01-23" "2004-07-28" "2001-12-25" "2001-12-24" "2001-12-22"
##  [6] "2001-11-04" "2001-11-03" "2004-08-21" "2001-05-28" "2004-08-22"
## [11] "2002-11-24" "2002-02-05" "2001-07-30" "2001-06-18" "2006-01-07"
## [16] "2001-12-23" "2004-08-20" "2006-01-06" "2006-01-08" "2007-01-01"
## [21] "2007-01-04" "2007-01-05" "2001-07-29"
Continue reading at Rpub.

Monday, May 22, 2017

when NAs are not NAs

Using TCGAbiolinks, I downloaded RNAseq data for LUAD and LUSC
library(TCGAbiolinks)
library(SummarizedExperiment)

# query_rna_LUAD.hg38 <- GDCquery(project = "TCGA-LUAD", data.category = "Transcriptome Profiling",
#                   data.type = "Gene Expression Quantification", 
#                   workflow.type = "HTSeq - Counts")
# 
# 
# query_rna_LUSC.hg38 <- GDCquery(project = "TCGA-LUSC", data.category = "Transcriptome Profiling",
#                   data.type = "Gene Expression Quantification", 
#                   workflow.type = "HTSeq - Counts")
# 
# GDCdownload(query_rna_LUAD.hg38, method = "client")
# GDCdownload(query_rna_LUSC.hg38, method = "client")
# 
# LUAD_rna_data <- GDCprepare(query_rna_LUAD.hg38)
# LUSC_rna_data <- GDCprepare(query_rna_LUSC.hg38)

# I have saved both R objects into disk
load("~/projects/mix_histology/data/TCGA_rna/TCGA_lung_rna.rda")
# a RangedSummarizedExperiment object
LUSC_rna_data
## class: RangedSummarizedExperiment 
## dim: 57035 551 
## metadata(0):
## assays(1): HTSeq - Counts
## rownames(57035): ENSG00000000003 ENSG00000000005 ...
##   ENSG00000281912 ENSG00000281920
## rowData names(3): ensembl_gene_id external_gene_name
##   original_ensembl_gene_id
## colnames(551): TCGA-77-8009-01A-11R-2187-07
##   TCGA-34-5239-01A-21R-1820-07 ... TCGA-NK-A7XE-01A-12R-A405-07
##   TCGA-43-6773-11A-01R-1949-07
## colData names(69): patient barcode ...
##   subtype_Homozygous.Deletions subtype_Expression.Subtype
The problem is with the meta data
LUSC_coldata<- colData(LUSC_rna_data)
LUAD_coldata<- colData(LUAD_rna_data)

we will see the different representations of NAs

table(LUAD_coldata$subtype_Smoking.Status, useNA = "ifany")
## 
##      Current reformed smoker for > 15 years 
##                                          78 
## Current reformed smoker for < or = 15 years 
##                                          77 
##                              Current smoker 
##                                          47 
##                         Lifelong Non-smoker 
##                                          32 
##                             [Not Available] 
##                                          11 
##                                        <NA> 
##                                         349
table(LUSC_coldata$subtype_Smoking.Status, useNA = "ifany")
## 
##      Current reformed smoker for > 15 years 
##                                          51 
## Current reformed smoker for < or = 15 years 
##                                          87 
##                              Current smoker 
##                                          28 
##                         Lifelong Non-smoker 
##                                           7 
##                                         N/A 
##                                           6 
##                                        <NA> 
##                                         372
we see NAs are represented either as real <NA>[Not Avaiable] or N/A. The first thing to do is to tidy the metadata changing all NAs to <NA>:
I will use stringr from the tidyverse packages.
library(stringr)
# one needs to \\ to escape [ and ]
str_replace(LUAD_coldata$subtype_Smoking.Status, "\\[Not Available\\]", NA)
## Error: `replacement` must be a character vector
This does not quite work. stringr enforces that the replacement is the same type as a character. instead:
str_replace(LUAD_coldata$subtype_Smoking.Status, "\\[Not Available\\]", NA_character_) %>% table(useNA = "ifany")
## .
## Current reformed smoker for < or = 15 years 
##                                          77 
##      Current reformed smoker for > 15 years 
##                                          78 
##                              Current smoker 
##                                          47 
##                         Lifelong Non-smoker 
##                                          32 
##                                        <NA> 
##                                         360
similarily:
str_replace(LUSC_coldata$subtype_Smoking.Status, "N/A", NA_character_) %>% table(useNA = "ifany")
## .
## Current reformed smoker for < or = 15 years 
##                                          87 
##      Current reformed smoker for > 15 years 
##                                          51 
##                              Current smoker 
##                                          28 
##                         Lifelong Non-smoker 
##                                           7 
##                                        <NA> 
##                                         378
to test if a vector contains any NAs:
# this is invalid
# LUAD_coldata$subtype_Smoking.Status == NA, instead, use is.na()
is.na(LUAD_coldata$subtype_Smoking.Status) %>% table()
## .
## FALSE  TRUE 
##   245   349
The traditional way to change a string to NA is:
myvector<- c("NA", "NA", "a", "b", "c", NA)
myvector
## [1] "NA" "NA" "a"  "b"  "c"  NA
myvector[myvector =="NA"]<- NA
myvector
## [1] NA  NA  "a" "b" "c" NA

Conclusions

  1. Data are messy, even if the data are packaged from a big project such as TCGA.
  2. Different representation of NA could be a devil in your data, tidy it to R’s representation is needed.
  3. stringr is a great tool in your belt for tidying data.