Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, July 31, 2013

tools for circular representation of omics data

OmicCircos bioconductor package is here http://bioconductor.org/packages/2.13/bioc/html/OmicCircos.html

I believe it is an R alternative to Circos http://circos.ca/

It looks like OmicCircos is easier to use than Circos....I was too reluctant to read the documentation
of Circos, and the installation and configuration look complicated....

Saturday, July 20, 2013

install bioawk in unbuntu

Bioawk is written by Heng Li, and it can handle formats like:
bed:
     1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts
sam:
    1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual
vcf:
    1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info
gff:
    1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute
fastx: 1:name 2:seq 3:qual 4:comment

 I followed the tutorial here https://github.com/vsbuffalo/bioawk-tutorial

my first try did not work

tommy@tommy-ThinkPad-T420:~$ git clone git://github.com/lh3/bioawk.git && cd bioawk && make && mv awk bioawk && sudo cp bioawk /usr/local/bin/
Cloning into 'bioawk'...
remote: Counting objects: 163, done.
remote: Compressing objects: 100% (89/89), done.
remote: Total 163 (delta 95), reused 136 (delta 74)
Receiving objects: 100% (163/163), 112.32 KiB, done.
Resolving deltas: 100% (95/95), done.
yacc -d awkgram.y
make: yacc: Command not found
make: *** [ytab.o] Error 127


It looks like I do not have  yacc or bison (the GNU
equivalent) installed.

tommy@tommy-ThinkPad-T420:~/bioawk$ sudo synaptic

search yacc, and install the bison.

after that, it worked.

tommy@tommy-ThinkPad-T420:~/bioawk$ bioawk
usage: bioawk [-F fs] [-v var=value] [-c fmt] [-H] [-f progfile | 'prog'] [file ...]

a quick tutorial for git:
https://github.com/vsbuffalo/git-demo



ubuntu 12.04 networking restart

If I suspend my computer at school, I can not connect to the internet when I get home. I searched on line and found the command at terminal:
 /etc/init.d/networking start

but it did not work, it says the command is deprecated....

Instead, type

tommy@tommy-ThinkPad-T420:~$ ifconfig
eth0      Link encap:Ethernet  HWaddr 00:21:cc:71:ab:50  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:20 Memory:f2500000-f2520000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:3526 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3526 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:440923 (440.9 KB)  TX bytes:440923 (440.9 KB)

wlan0     Link encap:Ethernet  HWaddr 60:d8:19:c7:42:9b  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:238663 errors:0 dropped:0 overruns:0 frame:0
          TX packets:168877 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:310906582 (310.9 MB)  TX bytes:20035785 (20.0 MB)

tommy@tommy-ThinkPad-T420:~$ sudo ifconfig wlan0 down
[sudo] password for tommy: 
tommy@tommy-ThinkPad-T420:~$ sudo ifconfig wlan0 up

Now, I had no problem! Thanks to my roommate YuRen!

Wednesday, July 17, 2013

Computational Competence for Biologists

After reading this post http://software-carpentry.org/blog/2013/07/computational-competence-for-biologists.html, I realized that I am far from being competent....

currently, I know some python, R(bioconductor), linux commands, sed, awk, galaxy and feel comfortable  to use command line based software like bedtools, samtools, bowtie etc. I still need to enhance my database skills though. More importantly, I will need to have some statistics knowledge.
I signed up for

Mathematical Biostatistics Boot Camp 1


a demonstration of the need to have good statistics knowledge:
Simon Anders from EMBL has very good slides about statistics behind RNA-seq analysis.


I quite agree that bioinformatics is pretty much about visualizing the data and finding patterns. So, I will need to master skills to clean data, transform data and visualize them. R has very good packages to generate figures and matplotlib from python is also worth learning. Sometimes, I am confused with R and python pandas package. manipulations of matrix and dataframe in these two different languages are somewhat similar but still different...I have to switch my thinking mode back and forth....

Anyway, I still have a long way to go, and I know I can not learn everything, but I can learn anything that I need at that moment. 



Tuesday, July 16, 2013

run ipython in emacs, matplotlib after plot.show( ), the ipython buffer remains hanging

I was learning how to plot in python using the matplotlib library.
when I started in the terminal with

tommy@tommy-ThinkPad-T420:~$ ipython --pylab

In [1]: plot(range(4))
Out[1]: [<matplotlib.lines.Line2D at 0x37b4b50>]

the figure showed up automatically, and the prompt was active in the shell.

However, when I run it through emacs, the figure did not show up until I typed plot.show( ),
and the ipython buffer is inactive unless you close the figure.  If you type plot.show( ) again,
nothing will show up.

A quick google:

http://stackoverflow.com/questions/9753885/pylab-matplotlib-show-waits-until-window-closes

" Add pylab.ion() (interactive mode) before the pylab.show() call. That will make the UI run in a separate thread and the call to show will return immediately."








Friday, July 12, 2013

yes, you need to validate the genomic discoveries

I just came back from San Diego, CA. It is a beautiful place, and I enjoyed the staying there very much.
well, as the title stated, we need to experimentally validate the genomic discoveries, as more and more NGS studies only focus on bioinformatic analysis of the large data  but do not have access to facilities to validate the claimed conclusions.

I am switching from a bench worker to a desktop worker, but I still enjoy doing experiment and the techniques I've learned from the wet lab will serve the purpose of validating the hypothesis generated by large data mining. well, I think that's my advantage. Good understanding of biology and computer are equally important.

See a post here:
http://massgenomics.org/2013/07/functional-validation-genomic-discoveries.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Massgenomics+%28MassGenomics%29


Thursday, July 4, 2013

a python script for extracting fastq sequences

a simple python script to handle fastq file.



fast fastq file parser, extract ~10,000 reads with their sequences and quality 
from a fastq file (~70milion reads) based on names  of the reads. HengLi in Princeton https://github.com/lh3/seqtk
has a wrapper for this kind of task.
the following code is from http://www.biostars.org/p/10353/  
this script demonstrates the usage of set(hash-able), a data structure that is much faster than list
when you have two files, put the information from one file into a container, loop over the other file.
usage: cat file.fastq | python parse_fastq.py id_file.txt > selected.fastq


#! /usr/bin/env python
import sys
# get filename from parameter
idfile = sys.argv[1]

# load ids in a set with  a set comprehension 

ids = set( x.strip() for x in open(idfile) )

# read the fastq file
handle = sys.stdin

while ids:
    #parse fastq
    idline = handle.readline()
    seq   = handle.readline()
    spacer = handle.readline()
    quals = handle.readline()

    id_name = idline[:-1] # except the newline \n
    if id_name in ids:
        #print fastq
        sys.stdout.write( '%s%s%s%s%' % ( idline, seq, spacer,\ quals) )
        ids.remove(id_name)



see link here https://github.com/lh3/seqtk

Monday, July 1, 2013

book recommendation: Sed & Awk 101 Hacks

I highly recommend this book for text manipulation
http://www.thegeekstuff.com/sed-awk-101-hacks-ebook/

Through my limited experience of "bioinformatics", I realized that we spend a lot of time to reformat the data ( mostly in txt format) to some other formats which can be fed into other programs.
python has a very good re (regular expression) module for that purpose, but I do like the one liner of Sed and Awk. Sometimes, I am just too lazy to write something like:

ifile = open(<filename>)

for line in ifile:
    do some parse

ifile.close()


because Sed and Awk read file line by line, execute the command for one line, and then process the next line, it is very concise without writing the explicit for loops.

This book is very comprehensive and has a lot of practical examples making you follow easily. I just finished Chapter 5 and I found it is really helpful, although I had some experience with sed and awk.

The NGS WikiBook: a dynamic collaborative online training effort with long-term sustainability

The NGS wiki book http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)

http://intl-bib.oxfordjournals.org/content/early/2013/06/21/bib.bbt045.full

for more details, refer to the link above.
Nine simple rules to begin with NGS analysis

RULE 1: DO NOT FEAR THE COMMAND LINE

RULE 2: KNOW THE CONVENTIONS

RULE 3: READ INTRODUCTORY REVIEWS

RULE 4: START WITH QUALITY CHECKING

RULE 5: PLAN FOR MISTAKES AND DOCUMENT WORKFLOW

RULE 6: ALWAYS GET INFORMED AND GET HELP IF STUCK

RULE 7: USE AN EFFICIENT INTEGRATIVE APPROACH

RULE 8: AVOID REINVENTING THE WHEEL  ( I mentioned this several times in my blog :) )

RULE 9: EDUCATION IS IMPORTANT