Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, May 20, 2015

change the mac terminal color scheme and use oh my zsh!

I just got my lab mac pro. I wanted to change the color scheme of the default terminal, which is just black and white.

I googled around, and found solarized .

you can download it and unzip it.
Then, open your terminal (you can search "terminal" at the upper right corner)

go to the preferences:

and import the color scheme for terminals in the folder, which is in the unzipped the folder solarized. I preferred the Solarized Dark ansi.terminal , and set it as Default.

Now,  download oh my zsh! by:
$ curl -L | sh

You will have to install git first for the upper command to work.
Now, when you fire up your terminal, it looks much more prettier! (there are many other schemes for oh my zsh, I found the default is good. You can change it by modifying the .zshrc file in your home directory.)

Tuesday, May 19, 2015

Using Awk to format the gene expression file to gct format for Gene Set Enrichment Analysis (GSEA)

After one runs microarray or RNA-seq analysis, usually he would do a Gene Set Enrichment Analysis (GSEA) analysis.  There are many tools to use. One of the most commonly used one is GSEA developed in Broad Institute.

It requires four data files to be loaded:
1. Expression dataset in res, gct, pcl or txt format
2. Phenotype labels in cls format
3. Gene sets in gmx or gmt formt
4. Chip annotations

The first impression of mine is that: Oh my, why there are so many different formats? Yes, after merging into the computational biology field for a while, I find that most of the time I spend is on data formatting. That's in consistence with many others' experiences.

Well, for this post, I will specifically show you how to format gene expression data file output from affy (for microarray) to gct format using awk. For RNA-seq data, you can do it similarly for DESeq2 and EdgR outputs (using normalized counts).

Let's look at the expression file output by affy:
# R code
## read in the data
Data<- ReadAffy()
## RMA normalization and get the eset (expressionSet) object
eset<- rma(Data)
e<- exprs(eset)
write.table( e, "raw_expression.txt", row.names=F, quote=F, sep="\t")

The file we have:

The required file format:

we see that the first column is the probe name and the other columns are expression values for different samples. The first problem is that the first line is one grid off; the first column should have a name "Name". In addition, we need to add two lines, and we need to add a dummy column in the second column. We will fix it step by step:

Now, we have the desired format:

You can certainly open the file in excel and edit it very easily. The file is only several MB big. However, when you have a file that is several hundred MB or several GB big, you can not open it with excel. Avoiding using excel is my ultimate goal, although it comes very handy for small data sets. Using excel for bioinformatics can cause problems:
It can change gene names to dates

Again, learning how to use awk is invaluable!

Thursday, May 14, 2015

Using Awk to join two files based on several columns

I was reading a thread on stackoverflow and found that this post was very interesting. I will go through the problem and the awk solution. Again, Awk is awesome!

It looks a bit messy when I copied the ipython notebook directly into Blogger. Ideally, one can write the blog using ipython notebook directly, but it does not work well with Blogger. see here
I do not want to use other blogging platforms for now. So, just bear it and you can go to here to see the clean ipython notebook. Github now randers ipython notebook.
I created some dummy files.
file_a is a tab-delimited bed file with 6 colums:
In [1]:
cat file_a.bed
chr1 123 aa b c d
chr1 234 a b c d
chr1 345 aa b c d
chr1 456 a b c d
file_b is the file that contain additional infomation, which we want to add to file_a:
In [2]:
cat file_b.bed
xxxx abcd chr1 123 aa c d e
yyyy defg chr1 345 aa e f g
we want to annotate file_a based on the fact that columns 3,4,5 in file_b are the same as columns 1,2,3 in file_a.
To do this, we are going to use Awk associated array. see a link
Let me execute the awk one-liner first and then explain what's going on here:
In [7]:
awk 'NR==FNR{a[$3,$4,$5]=$1OFS$2;next}{$6=a[$1,$2,$3];print}' OFS='\t' \
file_b.bed file_a.bed
chr1 123 aa b c xxxx abcd
chr1 234 a b c 
chr1 345 aa b c yyyy defg
chr1 456 a b c 
we annotated file_a using file_b. Aka, we added first two columns from file_b to file_a.
There are several things happening here:
we see built-in variables in awk: NR and FNR. NR is the line number of the
current processing line.
when awk read in multiple files, awk NR variable will give the total number of records
relative to all the input file. Awk FNR will give you number of records for each inpu
file. see a link
for all the built-in variables in awk.
Let's deomonstrate the difference between NR and FNR:
In [5]:
awk '{print FILENAME, NR}' file_a.bed file_b.bed
file_a.bed 1
file_a.bed 2
file_a.bed 3
file_a.bed 4
file_b.bed 5
file_b.bed 6
FILENAME is another built-in variable for the input file name of awk.
There are 4 lines in file_a and 2 lines in file_b, and NR increments for the total lines.
compare with FNR:
In [6]:
awk '{print FILENAME, FNR}' file_a.bed file_b.bed
file_a.bed 1
file_a.bed 2
file_a.bed 3
file_a.bed 4
file_b.bed 1
file_b.bed 2
Now, awk prints out the line numbers in respect to each file.
From the awk code, we are reading file_b first. NR==FNR means when NR equals to FNR
(this is true only for file_b) do the following: {a[$3,$4,$5]=$1OFS$2;next}.
We created an associated array named a using columns 3,4,5 in file_b as keys and
the columns 1 and 2: $1"\t"$2 as values. we set OFS="\t" in the end of the command.
next means to proceed for the next line, rather than execute the following { } code block.
when awk reads in the second file (file_a.bed), NR==FNR is not true, awk
exectues the second { } code block: {$6=a[$1,$2,$3];print}
we look up the associated array a we created from file_b.bed using
the first three columns in file_a.bed as keys, and assign column 6 to
the looked-up values and print it out the whole line.
Conclusion: Awk is very powerful in text wrangling. Once get used to the
syntax, you can do fairly complicated formatting in an awk one-liner.
I strongly recommand you to learn it.

Monday, May 4, 2015

IBash Notebook for reproducible research

I use command line a lot. It is awesome for data processing, text data formatting and even exploratory data analysis.

Last week, one of my colleagues complained that she forgot how she got the data and processed the data. With a future "ME" in mind, one needs to do extensive documentations of where, when and how you download and process the data, and document the versions of the tools used in the analysis.

Although there are many ways to make tasks on command line reproducible such as using  Drake and GNU make, it is still not as straightforward as using Ipython Notebook for python and R markdown files for R, respectively.

Luckily, I got to know from Jeroen Janssens, who wrote the "Data science at command line" Book, that there is a bash_kernal for Ipython notebook, and I gave it a try.
see a screenshot of the notebook:

see the whole notebook here:

Essentially, I copied the .ipynb file (it is a JSON file) and pasted it to a gist, and insert the gist link to the nbviewer website

With IBash Notebook, one can document the linux commands in a real-time manner and make his research more reproducible!