Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Friday, September 26, 2014

everyone will write a blog post on dplyr

dply is a nice R package to manipulate big data. There are several good tutorials I came across:

http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/
http://stat545-ubc.github.io/block009_dplyr-intro.html
http://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html
http://gettinggeneticsdone.blogspot.com/2014/08/do-your-data-janitor-work-like-boss.html

I followed the example in the early edition of Bioinformatics Data Skills on the dplyr part and put a gist below. The example used dplyr to manipulate and summarize the human annotation gff file downloaded here http://useast.ensembl.org/info/data/ftp/index.html.

I downloaded the ensemble 74 build which is different from the 75 build in the example of the book.
I believe the gencode v19 annotation file is based on ensemble 74.

I also included comments in the R code describing how to do the same job with linux command line.

Thursday, September 11, 2014

make a soft link

I got to know the ln linux command when I was reading The linux command line. I did not realize why I need this command to make links to the same data or file.

One situation I faced is that I have a lot of bam files  with long names in one folder, and when I execute a program in another folder, I have to pass the program a full path to the bam files.
while you think the TAB auto-completion helps most of the time,  it is still too much to specify the full path + long names (ugly). So I want to make a copy of the bam files in the current working director to save me from typing the full path. This is where ln command comes to rescue.

The other scenario is when you have multiple executable files in one folder ~/myprogram/bin, you want to execute the program from anywhere, you could link it to /usr/local/bin, or you can just copy the executable to /usr/local/bin or you can add ~/myprogram/bin to $PATH. see a link here http://unix.stackexchange.com/questions/116508/adding-to-path-vs-linking-from-bin

Let me demonstrate the usage of ln 

look at the man page
man ln
# I have two files in the playground directory and make two directories 

mkdir dir1 dir2
ls

output: tommy@tommy-ThinkPad-T420[playground] ls                              [ 2:32PM]
dir1  dir2  genes_h.txt  genes.txt


I then move the two txt files into dir1 and redirect to dir2 and make soft links of files in dir1 to dir2
mv *txt dir1
cd dir2
ln -s ../dir1/*txt .
ls
if we go to dir1 and make a softlink from there
cd ../dir1
ln -s ./*txt  ../dir2

The links will not work (too many levels of symbolic levels error message if you want to use the link ), instead you need to specify the full path

ln -s $PWD/*txt  ../dir2


see links here http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/
and http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/

Wednesday, September 3, 2014

mapping gene ids with mygene

Mapping gene ids is one of the routine jobs for bioinformatics. I was aware of several ways to do it including Biomart.

Update on 10/30/14, a mygene bioconductor package is online http://bioconductor.org/packages/release/bioc/html/mygene.html

Recently I got to know mygene, a python wrapper for the mygene.info services to map gene ids.
I found it very handy to convert gene ids.  see a gist below.

To use it,  cat input.txt | python geneSymbol2Entrez.py > output.txt
or python geneSymbol2Entrez.py input.txt > output.txt  where input.txt contains one gene name in each line. pretty neat!