OmicCircos bioconductor package is here http://bioconductor.org/packages/2.13/bioc/html/OmicCircos.html
I believe it is an R alternative to Circos http://circos.ca/
It looks like OmicCircos is easier to use than Circos....I was too reluctant to read the documentation
of Circos, and the installation and configuration look complicated....
A wet-dry hybrid biologist's take on genetics and genomics. Mostly is about Linux, R, python, reproducible research, open science and NGS. Grab my book to transform yourself to a computational biologist https://divingintogeneticsandgenomics.ck.page/
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Wednesday, July 31, 2013
Thursday, July 25, 2013
DESeq2 is now in Bioconductor
http://seqanswers.com/forums/showthread.php?t=28350
http://bioconductor.org/packages/2.12/bioc/html/DESeq2.html
Thank you Simon and your team!
http://bioconductor.org/packages/2.12/bioc/html/DESeq2.html
Thank you Simon and your team!
Tuesday, July 23, 2013
jMOSAiCS: joint analysis of multiple ChIP-seq datasets
see the paper here http://genomebiology.com/content/14/4/R38
it is a R package, may consider to use it.
it is a R package, may consider to use it.
Saturday, July 20, 2013
install bioawk in unbuntu
Bioawk is written by Heng Li, and it can handle formats like:
I followed the tutorial here https://github.com/vsbuffalo/bioawk-tutorial
my first try did not work
tommy@tommy-ThinkPad-T420:~$ git clone git://github.com/lh3/bioawk.git && cd bioawk && make && mv awk bioawk && sudo cp bioawk /usr/local/bin/
Cloning into 'bioawk'...
remote: Counting objects: 163, done.
remote: Compressing objects: 100% (89/89), done.
remote: Total 163 (delta 95), reused 136 (delta 74)
Receiving objects: 100% (163/163), 112.32 KiB, done.
Resolving deltas: 100% (95/95), done.
yacc -d awkgram.y
make: yacc: Command not found
make: *** [ytab.o] Error 127
It looks like I do not have yacc or bison (the GNU
equivalent) installed.
tommy@tommy-ThinkPad-T420:~/bioawk$ sudo synaptic
tommy@tommy-ThinkPad-T420:~/bioawk$ bioawk
usage: bioawk [-F fs] [-v var=value] [-c fmt] [-H] [-f progfile | 'prog'] [file ...]
a quick tutorial for git:
https://github.com/vsbuffalo/git-demo
bed:
1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts
sam:
1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual
vcf:
1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info
gff:
1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute
fastx:
1:name 2:seq 3:qual 4:comment
I followed the tutorial here https://github.com/vsbuffalo/bioawk-tutorial
my first try did not work
tommy@tommy-ThinkPad-T420:~$ git clone git://github.com/lh3/bioawk.git && cd bioawk && make && mv awk bioawk && sudo cp bioawk /usr/local/bin/
Cloning into 'bioawk'...
remote: Counting objects: 163, done.
remote: Compressing objects: 100% (89/89), done.
remote: Total 163 (delta 95), reused 136 (delta 74)
Receiving objects: 100% (163/163), 112.32 KiB, done.
Resolving deltas: 100% (95/95), done.
yacc -d awkgram.y
make: yacc: Command not found
make: *** [ytab.o] Error 127
It looks like I do not have yacc or bison (the GNU
equivalent) installed.
tommy@tommy-ThinkPad-T420:~/bioawk$ sudo synaptic
search yacc, and install the bison.
after that, it worked.
tommy@tommy-ThinkPad-T420:~/bioawk$ bioawk
usage: bioawk [-F fs] [-v var=value] [-c fmt] [-H] [-f progfile | 'prog'] [file ...]
a quick tutorial for git:
https://github.com/vsbuffalo/git-demo
ubuntu 12.04 networking restart
If I suspend my computer at school, I can not connect to the internet when I get home. I searched on line and found the command at terminal:
/etc/init.d/networking start
but it did not work, it says the command is deprecated....
Instead, type
tommy@tommy-ThinkPad-T420:~$ ifconfig
eth0 Link encap:Ethernet HWaddr 00:21:cc:71:ab:50
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Interrupt:20 Memory:f2500000-f2520000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:3526 errors:0 dropped:0 overruns:0 frame:0
TX packets:3526 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:440923 (440.9 KB) TX bytes:440923 (440.9 KB)
wlan0 Link encap:Ethernet HWaddr 60:d8:19:c7:42:9b
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:238663 errors:0 dropped:0 overruns:0 frame:0
TX packets:168877 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:310906582 (310.9 MB) TX bytes:20035785 (20.0 MB)
tommy@tommy-ThinkPad-T420:~$ sudo ifconfig wlan0 down
[sudo] password for tommy:
tommy@tommy-ThinkPad-T420:~$ sudo ifconfig wlan0 up
/etc/init.d/networking start
but it did not work, it says the command is deprecated....
Instead, type
tommy@tommy-ThinkPad-T420:~$ ifconfig
eth0 Link encap:Ethernet HWaddr 00:21:cc:71:ab:50
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Interrupt:20 Memory:f2500000-f2520000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:3526 errors:0 dropped:0 overruns:0 frame:0
TX packets:3526 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:440923 (440.9 KB) TX bytes:440923 (440.9 KB)
wlan0 Link encap:Ethernet HWaddr 60:d8:19:c7:42:9b
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:238663 errors:0 dropped:0 overruns:0 frame:0
TX packets:168877 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:310906582 (310.9 MB) TX bytes:20035785 (20.0 MB)
tommy@tommy-ThinkPad-T420:~$ sudo ifconfig wlan0 down
[sudo] password for tommy:
tommy@tommy-ThinkPad-T420:~$ sudo ifconfig wlan0 up
Now, I had no problem! Thanks to my roommate YuRen!
Wednesday, July 17, 2013
Computational Competence for Biologists
After reading this post http://software-carpentry.org/blog/2013/07/computational-competence-for-biologists.html, I realized that I am far from being competent....
currently, I know some python, R(bioconductor), linux commands, sed, awk, galaxy and feel comfortable to use command line based software like bedtools, samtools, bowtie etc. I still need to enhance my database skills though. More importantly, I will need to have some statistics knowledge.
I signed up for
a demonstration of the need to have good statistics knowledge:
currently, I know some python, R(bioconductor), linux commands, sed, awk, galaxy and feel comfortable to use command line based software like bedtools, samtools, bowtie etc. I still need to enhance my database skills though. More importantly, I will need to have some statistics knowledge.
I signed up for
Mathematical Biostatistics Boot Camp 1
a demonstration of the need to have good statistics knowledge:
Simon Anders from EMBL has very good slides about statistics behind RNA-seq analysis.
see post here http://www.rna-seqblog.com/presentations/comparative-analysis-of-rna-seq-data-with-deseq-and-dexseq/
I quite agree that bioinformatics is pretty much about visualizing the data and finding patterns. So, I will need to master skills to clean data, transform data and visualize them. R has very good packages to generate figures and matplotlib from python is also worth learning. Sometimes, I am confused with R and python pandas package. manipulations of matrix and dataframe in these two different languages are somewhat similar but still different...I have to switch my thinking mode back and forth....
Anyway, I still have a long way to go, and I know I can not learn everything, but I can learn anything that I need at that moment.
Tuesday, July 16, 2013
run ipython in emacs, matplotlib after plot.show( ), the ipython buffer remains hanging
I was learning how to plot in python using the matplotlib library.
when I started in the terminal with
tommy@tommy-ThinkPad-T420:~$ ipython --pylab
In [1]: plot(range(4))
Out[1]: [<matplotlib.lines.Line2D at 0x37b4b50>]
the figure showed up automatically, and the prompt was active in the shell.
However, when I run it through emacs, the figure did not show up until I typed plot.show( ),
and the ipython buffer is inactive unless you close the figure. If you type plot.show( ) again,
nothing will show up.
A quick google:
http://stackoverflow.com/questions/9753885/pylab-matplotlib-show-waits-until-window-closes
" Add
when I started in the terminal with
tommy@tommy-ThinkPad-T420:~$ ipython --pylab
In [1]: plot(range(4))
Out[1]: [<matplotlib.lines.Line2D at 0x37b4b50>]
However, when I run it through emacs, the figure did not show up until I typed plot.show( ),
and the ipython buffer is inactive unless you close the figure. If you type plot.show( ) again,
nothing will show up.
A quick google:
http://stackoverflow.com/questions/9753885/pylab-matplotlib-show-waits-until-window-closes
" Add
pylab.ion()
(interactive mode) before the pylab.show()
call. That will make the UI run in a separate thread and the call to show
will return immediately."Friday, July 12, 2013
yes, you need to validate the genomic discoveries
I just came back from San Diego, CA. It is a beautiful place, and I enjoyed the staying there very much.
well, as the title stated, we need to experimentally validate the genomic discoveries, as more and more NGS studies only focus on bioinformatic analysis of the large data but do not have access to facilities to validate the claimed conclusions.
I am switching from a bench worker to a desktop worker, but I still enjoy doing experiment and the techniques I've learned from the wet lab will serve the purpose of validating the hypothesis generated by large data mining. well, I think that's my advantage. Good understanding of biology and computer are equally important.
See a post here:
http://massgenomics.org/2013/07/functional-validation-genomic-discoveries.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Massgenomics+%28MassGenomics%29
well, as the title stated, we need to experimentally validate the genomic discoveries, as more and more NGS studies only focus on bioinformatic analysis of the large data but do not have access to facilities to validate the claimed conclusions.
I am switching from a bench worker to a desktop worker, but I still enjoy doing experiment and the techniques I've learned from the wet lab will serve the purpose of validating the hypothesis generated by large data mining. well, I think that's my advantage. Good understanding of biology and computer are equally important.
See a post here:
http://massgenomics.org/2013/07/functional-validation-genomic-discoveries.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Massgenomics+%28MassGenomics%29
Thursday, July 4, 2013
a python script for extracting fastq sequences
a simple python script to handle fastq file.
fast fastq file parser, extract ~10,000 reads with their sequences and quality
from a fastq file (~70milion reads) based on names of the reads. HengLi in Princeton https://github.com/lh3/seqtk
has a wrapper for this kind of task.
the following code is from http://www.biostars.org/p/10353/
this script demonstrates the usage of set(hash-able), a data structure that is much faster than list
when you have two files, put the information from one file into a container, loop over the other file.
usage: cat file.fastq | python parse_fastq.py id_file.txt > selected.fastq
#! /usr/bin/env python
import sys
# get filename from parameter
idfile = sys.argv[1]
# load ids in a set with a set comprehension
ids = set( x.strip() for x in open(idfile) )
# read the fastq file
handle = sys.stdin
while ids:
#parse fastq
idline = handle.readline()
seq = handle.readline()
spacer = handle.readline()
quals = handle.readline()
id_name = idline[:-1] # except the newline \n
if id_name in ids:
#print fastq
sys.stdout.write( '%s%s%s%s%' % ( idline, seq, spacer,\ quals) )
ids.remove(id_name)
see link here https://github.com/lh3/seqtk
fast fastq file parser, extract ~10,000 reads with their sequences and quality
from a fastq file (~70milion reads) based on names of the reads. HengLi in Princeton https://github.com/lh3/seqtk
has a wrapper for this kind of task.
the following code is from http://www.biostars.org/p/10353/
this script demonstrates the usage of set(hash-able), a data structure that is much faster than list
when you have two files, put the information from one file into a container, loop over the other file.
usage: cat file.fastq | python parse_fastq.py id_file.txt > selected.fastq
#! /usr/bin/env python
import sys
# get filename from parameter
idfile = sys.argv[1]
# load ids in a set with a set comprehension
ids = set( x.strip() for x in open(idfile) )
# read the fastq file
handle = sys.stdin
while ids:
#parse fastq
idline = handle.readline()
seq = handle.readline()
spacer = handle.readline()
quals = handle.readline()
id_name = idline[:-1] # except the newline \n
if id_name in ids:
#print fastq
sys.stdout.write( '%s%s%s%s%' % ( idline, seq, spacer,\ quals) )
ids.remove(id_name)
see link here https://github.com/lh3/seqtk
Monday, July 1, 2013
book recommendation: Sed & Awk 101 Hacks
I highly recommend this book for text manipulation
http://www.thegeekstuff.com/sed-awk-101-hacks-ebook/
Through my limited experience of "bioinformatics", I realized that we spend a lot of time to reformat the data ( mostly in txt format) to some other formats which can be fed into other programs.
python has a very good re (regular expression) module for that purpose, but I do like the one liner of Sed and Awk. Sometimes, I am just too lazy to write something like:
ifile = open(<filename>)
for line in ifile:
do some parse
ifile.close()
because Sed and Awk read file line by line, execute the command for one line, and then process the next line, it is very concise without writing the explicit for loops.
This book is very comprehensive and has a lot of practical examples making you follow easily. I just finished Chapter 5 and I found it is really helpful, although I had some experience with sed and awk.
http://www.thegeekstuff.com/sed-awk-101-hacks-ebook/
Through my limited experience of "bioinformatics", I realized that we spend a lot of time to reformat the data ( mostly in txt format) to some other formats which can be fed into other programs.
python has a very good re (regular expression) module for that purpose, but I do like the one liner of Sed and Awk. Sometimes, I am just too lazy to write something like:
ifile = open(<filename>)
for line in ifile:
do some parse
ifile.close()
because Sed and Awk read file line by line, execute the command for one line, and then process the next line, it is very concise without writing the explicit for loops.
This book is very comprehensive and has a lot of practical examples making you follow easily. I just finished Chapter 5 and I found it is really helpful, although I had some experience with sed and awk.
The NGS WikiBook: a dynamic collaborative online training effort with long-term sustainability
The NGS wiki book http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)
http://intl-bib.oxfordjournals.org/content/early/2013/06/21/bib.bbt045.full
for more details, refer to the link above.
Nine simple rules to begin with NGS analysis
http://intl-bib.oxfordjournals.org/content/early/2013/06/21/bib.bbt045.full
for more details, refer to the link above.
Nine simple rules to begin with NGS analysis
RULE 1: DO NOT FEAR THE COMMAND LINE
RULE 2: KNOW THE CONVENTIONS
RULE 3: READ INTRODUCTORY REVIEWS
RULE 4: START WITH QUALITY CHECKING
RULE 5: PLAN FOR MISTAKES AND DOCUMENT WORKFLOW
RULE 6: ALWAYS GET INFORMED AND GET HELP IF STUCK
RULE 7: USE AN EFFICIENT INTEGRATIVE APPROACH
RULE 8: AVOID REINVENTING THE WHEEL ( I mentioned this several times in my blog :) )
RULE 9: EDUCATION IS IMPORTANT
Subscribe to:
Posts (Atom)