dply is a nice R package to manipulate big data. There are several good tutorials I came across:
http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/
http://stat545-ubc.github.io/block009_dplyr-intro.html
http://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html
http://gettinggeneticsdone.blogspot.com/2014/08/do-your-data-janitor-work-like-boss.html
I followed the example in the early edition of Bioinformatics Data Skills on the dplyr part and put a gist below. The example used dplyr to manipulate and summarize the human annotation gff file downloaded here http://useast.ensembl.org/info/data/ftp/index.html.
I downloaded the ensemble 74 build which is different from the 75 build in the example of the book.
I believe the gencode v19 annotation file is based on ensemble 74.
I also included comments in the R code describing how to do the same job with linux command line.
A wet-dry hybrid biologist's take on genetics and genomics. Mostly is about Linux, R, python, reproducible research, open science and NGS. Grab my book to transform yourself to a computational biologist https://divingintogeneticsandgenomics.ck.page/
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Friday, September 26, 2014
Thursday, September 11, 2014
make a soft link
I got to know the ln linux command when I was reading The linux command line. I did not realize why I need this command to make links to the same data or file.
dir1 dir2 genes_h.txt genes.txt
I then move the two txt files into dir1 and redirect to dir2 and make soft links of files in dir1 to dir2
see links here http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/
and http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/
One situation I faced is that I have a lot of bam files with long names in one folder, and when I execute a program in another folder, I have to pass the program a full path to the bam files.
while you think the TAB auto-completion helps most of the time, it is still too much to specify the full path + long names (ugly). So I want to make a copy of the bam files in the current working director to save me from typing the full path. This is where ln command comes to rescue.
The other scenario is when you have multiple executable files in one folder ~/myprogram/bin, you want to execute the program from anywhere, you could link it to /usr/local/bin, or you can just copy the executable to /usr/local/bin or you can add ~/myprogram/bin to $PATH. see a link here http://unix.stackexchange.com/questions/116508/adding-to-path-vs-linking-from-bin
Let me demonstrate the usage of ln
look at the man page
man ln
# I have two files in the playground directory and make two directories
mkdir dir1 dir2
ls
output: tommy@tommy-ThinkPad-T420[playground] ls [ 2:32PM]ls
dir1 dir2 genes_h.txt genes.txt
I then move the two txt files into dir1 and redirect to dir2 and make soft links of files in dir1 to dir2
mv *txt dir1
cd dir2
ln -s ../dir1/*txt .
ls
if we go to dir1 and make a softlink from therecd dir2
ln -s ../dir1/*txt .
ls
cd ../dir1
ln -s ./*txt ../dir2
The links will not work (too many levels of symbolic levels error message if you want to use the link ), instead you need to specify the full pathln -s ./*txt ../dir2
ln -s $PWD/*txt ../dir2
see links here http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/
and http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/
Wednesday, September 10, 2014
converting gene ids using bioconductor with biomaRt and annotation packages
I had a post using mygene to convert gene ids. Bioconductor can do the same job.
I put a gist on github.
For more examples see posts from Dave Tang:
http://davetang.org/muse/2013/12/16/bioconductor-annotation-packages/
http://davetang.org/muse/2013/05/23/using-the-bioconductor-annotation-packages/
http://davetang.org/muse/2013/11/25/thoughts-converting-gene-identifiers/
I put a gist on github.
For more examples see posts from Dave Tang:
http://davetang.org/muse/2013/12/16/bioconductor-annotation-packages/
http://davetang.org/muse/2013/05/23/using-the-bioconductor-annotation-packages/
http://davetang.org/muse/2013/11/25/thoughts-converting-gene-identifiers/
Wednesday, September 3, 2014
mapping gene ids with mygene
Mapping gene ids is one of the routine jobs for bioinformatics. I was aware of several ways to do it including Biomart.
Update on 10/30/14, a mygene bioconductor package is online http://bioconductor.org/packages/release/bioc/html/mygene.html
Recently I got to know mygene, a python wrapper for the mygene.info services to map gene ids.
I found it very handy to convert gene ids. see a gist below.
To use it, cat input.txt | python geneSymbol2Entrez.py > output.txt
or python geneSymbol2Entrez.py input.txt > output.txt where input.txt contains one gene name in each line. pretty neat!
Update on 10/30/14, a mygene bioconductor package is online http://bioconductor.org/packages/release/bioc/html/mygene.html
Recently I got to know mygene, a python wrapper for the mygene.info services to map gene ids.
I found it very handy to convert gene ids. see a gist below.
To use it, cat input.txt | python geneSymbol2Entrez.py > output.txt
or python geneSymbol2Entrez.py input.txt > output.txt where input.txt contains one gene name in each line. pretty neat!
Subscribe to:
Posts (Atom)