Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, June 15, 2017

bwa aln or bwa mem for short reads (36bp)

My ChIP-seq data are 36bp single end reads. I usually use bowtie1 for mapping ChIP-seq reads, but bowtie1 does not handle indels. Since I want to call mutations on the ChIP-seq reads, I have to use another aligner BWA, the most popular mapper written by Heng Li.

The github page says if reads < 70bp, bwa aln should be used. Otherwise, bwa mem should be used.
bwa mem is a more recent algorithm (should be better?).

I searched on biostar, and found When and why is bwa aln better then bwa mem?

I did a simulation test using Teaser using default setting for each aligner.

The results are shown below:

The mapping rate:


Memory usage:


Run time:

Indeed, BWA aln is a little better than BWA mem for short reads.

For a real data set, the samtools flagstat results are shown below:

bwa aln:
282967631 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
18963259 + 0 duplicates
240660130 + 0 mapped (85.05% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

bwa mem:
282967631 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
18332921 + 0 duplicates
236558306 + 0 mapped (83.60% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Indeed, bwa aln has a moderate higher mapping rate and a shorter run time for short 36bp reads.

Monday, June 5, 2017

Weather and crimes in Chicago

How the story starts

I was attending 2017 American Society of Clinical Oncology (ASCO) annual meeting in Chicago. It was my first time to do a uber from the airport to our hotel. We had a nice uber driver and we talked a lot about the city. He said he grew up in the south part of Chicago and there are a lot of crimes there. He knows the city so well that he tries his best to avoid bad districts to pick up customers. When strangers meet, topic will always be weather. I certainly joked about the badness of winter in Chicago and he replied back with the badness of hotness in Houston.
He said there are around 600 cases of shooting per year in Chicago, and in the holidays such as Memorial day, shooting arises to 40 per day. I asked him why do you think there is such an increase. He said: I think it has to do with the weather. In the Memorial day, it becomes warmer. everybody has access to everybody. that’s why the crime rate is higher as well."
As a budding data scientist (that’s how I define myself) to interrogate data from DNA sequencing on tumor samples, I should not just take his words, I want to verify this by some evidences. Initially, I was surprised that weather can be associated with crime rate. Then I googled around when I got the hotel and it turns out that this notion is very common.
See a post on this topic Chicago Crime vs. Weather: Under the Hood. The author did a very good job to demonstrate the association of weather and crime rate in Chicago.
There is even a Kaggle channel for this topic.
In the post I linked, the analysis was done in python. I want to do some similar and more analysis using R. That’s why I am firing up Rstudioand start to type.

Download the crime data

To associate crime rate with weather, I need data sets for them respectively.
The city of Chicago has a very nice portol for the crime data.
I downloaded the crime data with the csv format. Note that the data set is relatively big (~1.5G). My computer has enough memory to hold the whole data set.
less -S ~/Downloads/Crimes_-_2001_to_present.csv | wc -l
# 6 million lines
##  6346217
# load in the libraries
library(tidyverse)
library(ggplot2)
library(lubridate)

crime<- read_csv("~/Downloads/Crimes_-_2001_to_present.csv", col_names =T)
## split the Date column, keep only the day information

crime<- crime %>% mutate(Day = gsub("([0-9]{2}/[0-9]{2}/[0-9]{4}) .+", "\\1", Date))

## change the Day column to a date type using mdy function from lubridate
crime<- crime %>% mutate(Day_date = mdy(Day)) %>% dplyr::select(-Date, -Day)

## what's the date range for the crime data?
range(crime$Day_date)
## [1] "2001-01-01" "2017-05-27"

Get the weather data

The weather data is a bit harder to get if you follow the linked post above. luckily, I found a prepared weather data set from the kaggle channel (was uploaded two days ago when I wrote this! lucky me).
UPDATE: unfortunately, the weather data set was not clean and there are many missing dates.
On twitter, Kevin Johnson @bigkage suggested a package weatherData.
only the developmental branch in github is working, the package on CRAN does not work…
Take a look at the rnoaa package as well. A post on it https://recology.info/2015/07/weather-data-with-rnoaa/
library("devtools")
#install_github("Ram-N/weatherData")
library(weatherData)

# O'Hare airport code is ORD, worked like a charm!
# other columns can be fetched. ?getWeatherForDate. for me, only temperature is needed
weather <- getWeatherForDate("ORD", "2001-01-01", "2017-05-27")
dim(weather)
However, only 388 records were found.
I opened an issue on github.
The reason for this is that weatherUnderground doesn’t want us to pull a huge volume of data in a single file. (For some reason, they truncate the number of rows to be 400 rows or less.)
To fetch all the data:
multi_weather <- lapply(as.character(2001:2017), 
                        function(x){getWeatherForYear("ORD", x)})

weather <- do.call(rbind, multi_weather)
merge the two data sets
str(weather)
## 'data.frame':    5977 obs. of  4 variables:
##  $ Date             : POSIXlt, format: "2001-01-01" "2001-01-02" ...
##  $ Max_TemperatureF : int  17 44 50 57 46 54 39 45 46 51 ...
##  $ Mean_TemperatureF: int  6 24 34 42 36 40 32 32 32 40 ...
##  $ Min_TemperatureF : int  -6 3 19 26 26 26 26 18 17 28 ...
str(crime)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6346216 obs. of  22 variables:
##  $ ID                  : int  5437456 5437458 5437459 5437460 5437461 5437462 5437463 5437464 5437465 5437466 ...
##  $ Case Number         : chr  "HN270582" "HN237768" "HN247565" "HN270399" ...
##  $ Block               : chr  "028XX E 90TH ST" "013XX S CHRISTIANA AVE" "024XX S SAWYER AVE" "086XX S LAFLIN ST" ...
##  $ IUCR                : chr  "1822" "2024" "2022" "1320" ...
##  $ Primary Type        : chr  "NARCOTICS" "NARCOTICS" "NARCOTICS" "CRIMINAL DAMAGE" ...
##  $ Description         : chr  "MANU/DEL:CANNABIS OVER 10 GMS" "POSS: HEROIN(WHITE)" "POSS: COCAINE" "TO VEHICLE" ...
##  $ Location Description: chr  "STREET" "SIDEWALK" "SIDEWALK" "OTHER" ...
##  $ Arrest              : chr  "true" "true" "true" "false" ...
##  $ Domestic            : chr  "false" "false" "false" "true" ...
##  $ Beat                : chr  "0423" "1021" "1024" "0614" ...
##  $ District            : chr  "004" "010" "010" "006" ...
##  $ Ward                : int  10 24 22 21 4 10 28 32 16 7 ...
##  $ Community Area      : int  46 29 30 71 35 55 27 7 61 48 ...
##  $ FBI Code            : chr  "18" "18" "18" "14" ...
##  $ X Coordinate        : int  1196743 1154244 1155082 1167841 1180453 1199105 1154131 1165540 1164423 1192802 ...
##  $ Y Coordinate        : int  1845841 1893642 1887583 1847481 1881415 1816541 1900784 1917732 1870859 1843576 ...
##  $ Year                : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Updated On          : chr  "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" ...
##  $ Latitude            : num  41.7 41.9 41.8 41.7 41.8 ...
##  $ Longitude           : num  -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ Location            : chr  "(41.73185728, -87.55483341)" "(41.863979388, -87.709253509)" "(41.84733605, -87.706339649)" "(41.737026255, -87.660665675)" ...
##  $ Day_date            : Date, format: "2007-04-07" "2007-03-20" ...
# change the POSIXlt to date 
weather$Date<- as.Date(weather$Date)
crime_weather<- inner_join(crime, weather, by = c("Day_date" = "Date"))

## Note that weather information on some dates were missing
anti_join(crime, weather, by = c("Day_date" = "Date")) %>% .$Day_date %>% unique()
##  [1] "2005-01-23" "2004-07-28" "2001-12-25" "2001-12-24" "2001-12-22"
##  [6] "2001-11-04" "2001-11-03" "2004-08-21" "2001-05-28" "2004-08-22"
## [11] "2002-11-24" "2002-02-05" "2001-07-30" "2001-06-18" "2006-01-07"
## [16] "2001-12-23" "2004-08-20" "2006-01-06" "2006-01-08" "2007-01-01"
## [21] "2007-01-04" "2007-01-05" "2001-07-29"
Continue reading at Rpub.

Monday, May 22, 2017

when NAs are not NAs

Using TCGAbiolinks, I downloaded RNAseq data for LUAD and LUSC
library(TCGAbiolinks)
library(SummarizedExperiment)

# query_rna_LUAD.hg38 <- GDCquery(project = "TCGA-LUAD", data.category = "Transcriptome Profiling",
#                   data.type = "Gene Expression Quantification", 
#                   workflow.type = "HTSeq - Counts")
# 
# 
# query_rna_LUSC.hg38 <- GDCquery(project = "TCGA-LUSC", data.category = "Transcriptome Profiling",
#                   data.type = "Gene Expression Quantification", 
#                   workflow.type = "HTSeq - Counts")
# 
# GDCdownload(query_rna_LUAD.hg38, method = "client")
# GDCdownload(query_rna_LUSC.hg38, method = "client")
# 
# LUAD_rna_data <- GDCprepare(query_rna_LUAD.hg38)
# LUSC_rna_data <- GDCprepare(query_rna_LUSC.hg38)

# I have saved both R objects into disk
load("~/projects/mix_histology/data/TCGA_rna/TCGA_lung_rna.rda")
# a RangedSummarizedExperiment object
LUSC_rna_data
## class: RangedSummarizedExperiment 
## dim: 57035 551 
## metadata(0):
## assays(1): HTSeq - Counts
## rownames(57035): ENSG00000000003 ENSG00000000005 ...
##   ENSG00000281912 ENSG00000281920
## rowData names(3): ensembl_gene_id external_gene_name
##   original_ensembl_gene_id
## colnames(551): TCGA-77-8009-01A-11R-2187-07
##   TCGA-34-5239-01A-21R-1820-07 ... TCGA-NK-A7XE-01A-12R-A405-07
##   TCGA-43-6773-11A-01R-1949-07
## colData names(69): patient barcode ...
##   subtype_Homozygous.Deletions subtype_Expression.Subtype
The problem is with the meta data
LUSC_coldata<- colData(LUSC_rna_data)
LUAD_coldata<- colData(LUAD_rna_data)

we will see the different representations of NAs

table(LUAD_coldata$subtype_Smoking.Status, useNA = "ifany")
## 
##      Current reformed smoker for > 15 years 
##                                          78 
## Current reformed smoker for < or = 15 years 
##                                          77 
##                              Current smoker 
##                                          47 
##                         Lifelong Non-smoker 
##                                          32 
##                             [Not Available] 
##                                          11 
##                                        <NA> 
##                                         349
table(LUSC_coldata$subtype_Smoking.Status, useNA = "ifany")
## 
##      Current reformed smoker for > 15 years 
##                                          51 
## Current reformed smoker for < or = 15 years 
##                                          87 
##                              Current smoker 
##                                          28 
##                         Lifelong Non-smoker 
##                                           7 
##                                         N/A 
##                                           6 
##                                        <NA> 
##                                         372
we see NAs are represented either as real <NA>[Not Avaiable] or N/A. The first thing to do is to tidy the metadata changing all NAs to <NA>:
I will use stringr from the tidyverse packages.
library(stringr)
# one needs to \\ to escape [ and ]
str_replace(LUAD_coldata$subtype_Smoking.Status, "\\[Not Available\\]", NA)
## Error: `replacement` must be a character vector
This does not quite work. stringr enforces that the replacement is the same type as a character. instead:
str_replace(LUAD_coldata$subtype_Smoking.Status, "\\[Not Available\\]", NA_character_) %>% table(useNA = "ifany")
## .
## Current reformed smoker for < or = 15 years 
##                                          77 
##      Current reformed smoker for > 15 years 
##                                          78 
##                              Current smoker 
##                                          47 
##                         Lifelong Non-smoker 
##                                          32 
##                                        <NA> 
##                                         360
similarily:
str_replace(LUSC_coldata$subtype_Smoking.Status, "N/A", NA_character_) %>% table(useNA = "ifany")
## .
## Current reformed smoker for < or = 15 years 
##                                          87 
##      Current reformed smoker for > 15 years 
##                                          51 
##                              Current smoker 
##                                          28 
##                         Lifelong Non-smoker 
##                                           7 
##                                        <NA> 
##                                         378
to test if a vector contains any NAs:
# this is invalid
# LUAD_coldata$subtype_Smoking.Status == NA, instead, use is.na()
is.na(LUAD_coldata$subtype_Smoking.Status) %>% table()
## .
## FALSE  TRUE 
##   245   349
The traditional way to change a string to NA is:
myvector<- c("NA", "NA", "a", "b", "c", NA)
myvector
## [1] "NA" "NA" "a"  "b"  "c"  NA
myvector[myvector =="NA"]<- NA
myvector
## [1] NA  NA  "a" "b" "c" NA

Conclusions

  1. Data are messy, even if the data are packaged from a big project such as TCGA.
  2. Different representation of NA could be a devil in your data, tidy it to R’s representation is needed.
  3. stringr is a great tool in your belt for tidying data.

Thursday, April 20, 2017

SHERLOCK to detect early circulating tumor DNA

Early detection of circulating tumor DNA is quite challenging. Read this comment in Cell:


And here comes SHERLOCK(Specific High Sensitivity Enzymatic Reporter UnLOCKing)!
from Nucleic acid detection with CRISPR-Cas13a/C2c2,  a study leaded by Feng Zhang in MIT.

Abstract:
Rapid, inexpensive, and sensitive nucleic acid detection may aid point-of-care pathogen detection, genotyping, and disease monitoring. The RNA-guided, RNA-targeting CRISPR effector Cas13a (previously known as C2c2) exhibits a “collateral effect” of promiscuous RNAse activity upon target recognition. We combine the collateral effect of Cas13a with isothermal amplification to establish a CRISPR-based diagnostic (CRISPR-Dx), providing rapid DNA or RNA detection with attomolar sensitivity and single-base mismatch specificity. We use this Cas13a-based molecular detection platform, termed SHERLOCK (Specific High Sensitivity Enzymatic Reporter UnLOCKing), to detect specific strains of Zika and Dengue virus, distinguish pathogenic bacteria, genotype human DNA, and identify cell-free tumor DNA mutations. Furthermore, SHERLOCK reaction reagents can be lyophilized for cold-chain independence and long-term storage, and readily reconstituted on paper for field applications.

This may be a game-changer in the field.


several very interesting histone modification (epigenetics) papers

If you have read my previous post Misconcept of epigenetics , you will know that: histone modification is not epigenetics!

However, recently, four papers in Science actually showed that histone modifications actually have a component of inheritance:

Causal role for inheritance of H3K27me3 in maintaining the OFF state of a Drosophila HOX gene

DNA sequence-dependent epigenetic inheritance of gene silencing and histone H3K9 methylation

Propagation of Polycomb-repressed chromatin requires sequence-specific recruitment to DNA


Transgenerational transmission of environmental information in C. elegans

and one paper in Molecular Cell:

SNF2 Family Protein Fft3 Suppresses Nucleosome Turnover to Promote Epigenetic Inheritance and Proper Replication


one more on Nature Genetics:

Interestingly, all the studies were carried out on model organisms: Drosophila and C.elegans!

Read this blog post if you are into this question: what is epigenetics ?

Wednesday, April 19, 2017

sshfs on ubuntu and ssh key

Two things I want to keep a note here:

First, if you ever have set up a shh key for connecting to remote server, you need to be aware that password-less shh key only works when your home directory on the server is not 777 (writable by others).

see this stackexchange post .

Second, I was following https://www.cyberciti.biz/faq/how-to-mount-remote-directory-filesystems-with-sshfs-on-linux/ to set up sshfs on my ubuntu machine. I put down a gist below.

Friday, February 24, 2017

use bash associate array to generate commands

The problem

I am running pyclone recently to determine the clonality of our collected tumor samples. It needs tumor purity (estimated from sequenza) for input. I have 24 samples, and I want to generate pyclone commands for all of them.

The solution

I usually generate command by bash and then use another bash wrapper to generate pbs files on HPC ((Thanks to @SBAmin). now for each patient, I have two samples. How should I generate the commands? I am still better in R and Unix than python, so I used the associated array in bash.
First, generate a file containing the tumor purity for each tumor:
head -6 all_tumor_purity_no_header.txt
0.69    Pa25T1
0.26    Pa25T2
0.49    Pa26T1
0.37    Pa26T2
0.9 Pa27T1
0.92    Pa27T2
This bash script uses associate array to contain tumor purity. read more at http://www.artificialworlds.net/blog/2012/10/17/bash-associative-array-examples/
make_commands.sh:
#! /bin/bash
set -euo pipefail

## build the array to contain the tumor purity, like a python dict
## have to declare by -A
declare -A cols

while read purity sample
do
    cols[$sample]=$purity
done < all_purity_no_header.txt 

echo ${cols[@]}

## generate commands
for i in Pa{25..37}
do
   echo PyClone run_analysis_pipeline --in_file ${i}T1_pyclone.tsv ${i}T2_pyclone.tsv --tumour_contents ${cols[${i}T1]} ${cols[${i}T2]} --samples ${i}T1 ${i}T2 --density pyclone_binomial --working_dir ${i}T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000 > ${i}_pyclone_commands.txt
done
chmod u+x make_commands.sh
./make_commands.sh
what you get: cat *commands.txt
PyClone run_analysis_pipeline --in_file Pa25T1_pyclone.tsv Pa25T2_pyclone.tsv --tumour_contents 0.69 0.26 --samples Pa25T1 Pa25T2 --density pyclone_binomial --working_dir Pa25T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000
PyClone run_analysis_pipeline --in_file Pa26T1_pyclone.tsv Pa26T2_pyclone.tsv --tumour_contents 0.49 0.37 --samples Pa26T1 Pa26T2 --density pyclone_binomial --working_dir Pa26T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000
PyClone run_analysis_pipeline --in_file Pa27T1_pyclone.tsv Pa27T2_pyclone.tsv --tumour_contents 0.9 0.92 --samples Pa27T1 Pa27T2 --density pyclone_binomial --working_dir Pa27T_pyclone_analysis --min_cluster_size 2 --seed 123 --num_iters 50000
....
then:
use the makemsub wrapper:


find *commands.txt | parallel 'makemsub -a {} -j {/.} -o a -c 4 -t "48:00:00" -m 16g > {/.}.pbs'

for pbs in *pbs
do
  msub $pbs
  sleep 1
done
I usually do it for simple tasks. e.g. only one or two commands are evoked. For more complex workflows, a workflow tool such as snakemake is better.

Wednesday, February 1, 2017

How to enable the scroll mode for tmux

tmux config file

it changed the key binding from control + b to control + a if you are familiar with the screen shortcuts.
control + a + c will create a new window. control + a + Space will move to previous window. control + a + n will move to next window.
control + a + ?
will show you all the shortcuts.

scroll mode

One problem with screen or tmux is that you have to press control + a + [ to enter the copy mode, and and control + a + ] to paste it. I want to just use the mouse to scroll up and down and copy/paste.
read this long thread github issue: https://github.com/tmux/tmux/issues/145 The solution that worked for me:
git clone https://github.com/tmux-plugins/tpm ~/.tmux/plugins/tpm
Put this at the bottom of .tmux.conf:
# List of plugins
set -g @plugin 'tmux-plugins/tpm'
set -g @plugin 'tmux-plugins/tmux-sensible'

# Other examples:
# set -g @plugin 'github_username/plugin_name'
# set -g @plugin 'git@github.com/user/plugin'
# set -g @plugin 'git@bitbucket.com/user/plugin'

# Initialize TMUX plugin manager (keep this line at the very bottom of tmux.conf)
run '~/.tmux/plugins/tpm/tpm'

open your .tmux.conf.
To enable mouse-mode in tmux 2.1+, put the following line in your ~/.tmux.conf:
set-option -g mouse on

then add the following line to your .tmux.conf file:
set -g @plugin 'nhdaly/tmux-better-mouse-mode'
  • install it
# start a new session
tmux

# install plugin
`control + a + I (captial)` to install all the plugins.

Now if you scroll up with your mouse, you will enter into copy mode automatically, and when you scroll down to the end of the current screen, you will exit the copy mode automatically.
  • copy and paste
If you scroll up and select the text you want to copy by left-click and drag, you will exit the copy mode instantly, and the content you selected will be copied in the buffer. You just need to control + a + ] to paste it. very cool!