Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Saturday, June 29, 2013

VCF file annotation and manipulation tools

VCF (variant calling format) file, as specified in 1000 Genome
is a common file to handle the variant information from the NGS sequencing data.

many tools have been developed to manipulate, extract information with this format.
some popular ones are:
tabix for fast indexing and extracting certain regions of the whole genome.
VCFtools for many more manipulations including annotating, merging, concatenating, comparing etc.
bedtools for genomic region overlapping calculation. etc
I have some experience with the above three tools.

by the way, I have not used the GATK tools the broad Institute yet, I am sure I will have a try sometime later.

after google, I found several more:

Taser is the one I just came across
It is based on R and can extract variant info by gene names fairly easily

Again, there are so many tools out there. Depending on your needs, choose the right one for you.

Wednesday, June 26, 2013

bubble sort and merge sort

I promised to write about bubble sort and merge sort in my previous blog

Finally, I got some time to write it. My undergraduate helped me to do some molecular cloning stuff today, I guess I will be too lazy to do bench work :)

The examples I am going to use are from the MIT open course lecture 9 and 10

The question is that you have a list of numbers, and how do you sort it from small to big.

I will lay down the python code first.

A trick....I copied (M-w) the code from my emacs editor and tried to paste (ctrl-v)  in the blog, but it did not work. In fact, click the  middle mouse button did the trick :)

# a demonstration of buble sort and merge sort, p is a list of integers
# for bubble sort  swap the values if p[i+1] < p[i]

def bubblesort(p):
    for j in range(len(p)):
        for i in range(len(p)-1):
           if p[i] > p[i+1]:
                 temp = p[i]
                 p[i] = p[i+1]
                 p[i+1] = temp
    return p

# for bubble sort, we can do better, when there is no swap, we stop the for loop

def bubblesort2(p):
      swapped = True
      while swapped:
            swapped = False
            for i in range(len(p)-1):
                  if p[i] > p[i+1]:
                        temp = p[i]
                        p[i] = p[i+1]
                        p[i+1] = temp
                        swapped = True
      return p

# divide and conqure method merge-sort

# divide the list in half, continue until we have single list, merge sub-lists

def mergesort(p):
      # print p
      if len(p) < 2 :
            return p[:]
            middle = len(p) / 2
            left = mergesort(p[:middle])   # this is the beauty of recursion, further break down the list to smaller lists
            right =mergesort(p[middle:])
            together = merge(left, right)
            #print "merged", together
            return together

      # now we need to define the merge function, this function merges two sorted lists of numbers.

def merge(left, right):
      result = []
      i,j = 0,0
      while i < len(left) and j < len(right):
                    if left[i] <= right[j]:
                        i = i+1
                        j = j+1
      while i< len(left):
            i = i + 1
      while j < len(right):
            j = j + 1
      return result


The code above is pretty self-explanatory.
for bubblesort, we have a nested loop, the inner loop put the biggest number to the end of the list every time when execute the outer loop.

bubblesort2 uses a variable "swapped" to track  if there is any swap occurs. if no swap occurs, the list is already pre-sorted, quit the loop.  return the result.

mergesort splits the list into smaller sub-list until only single number list. Then, we merge back the sorted list into a big list.

Let's have a look at the performance of these three functions by Ipython's magic function %time

In [15]: time a=bubblesort(range(10000))
CPU times: user 7.86 s, sys: 0.00 s, total: 7.86 s
Wall time: 7.88 s

In [16]: time a=bubblesort2(range(10000))
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

In [17]: time a=mergesort(range(10000))
CPU times: user 0.05 s, sys: 0.01 s, total: 0.06 s
Wall time: 0.05 s

in this particular test, I generated a list of range(10000): [0,1,2,3,4,5......999]
you will see bubblesort took the longest time, because it is not aware that the list is already pre-sorted.
bubblesort2 caught that and return the result instantly.
performance of mergesort is also decent. If we had a not pre-sorted list, the mergesort would out-perform the bubblesort2.

Finally, I want to show off my Emacs python editor :)

Tuesday, June 25, 2013

use sed to print out every n lines, split pair-end FASTQ file

tommy@tommy-ThinkPad-T420:~$ cat test.txt 

#print out the first two lines of every 4 lines. -n flag suppress all of the other lines and only print the line 
you specified. -e option tells sed to accept multiple p (print) command.

tommy@tommy-ThinkPad-T420:~$ sed -ne '1~4p;2~4p' test.txt 

This trick would be useful if you have a pair-end FASTq file and want to split it into two files.

see here:

and here two reads in one fastq

from SRA file:

How to extract paired-end reads from SRA files

SRA(NCBI) stores all the sequencing run as single "sra" or "lite.sra" file. You may want separate files if you want to use the data from paired-end sequencing. When I run SRA toolkit's "fastq-dump" utility on paired-end sequencing SRA files, sometimes I get only one files where all the mate-pairs are stored in one file rather than two or three files.
The solution for the problem is to always run fastq-dump with "--split-3" option. If the experiment is single-end sequencing, only one fastq file will be generated. If it is paired-end sequencing, there may be two or three fastq files.
Two files (with suffix "_1" and "_2") are matched mate-pair read file where as the third one (without any suffix) contains all the reads that do not have any mate-paires (or SRA couldn't resolve mate-paires for them).

Hope my experiences with NCBI SRA data handling help the readership.

Saturday, June 22, 2013

very good introduction to Emacs

see the post here:

Absolute Beginner's Guide to Emacs

I’ve been using Emacs () as my primary text editor for several years now. It takes some getting used to — the keyboard shortcuts are completely different from what you’re probably familiar with, e.g. Ctrl-C for copy and Ctrl-V for paste. Despite the somewhat steep initial learning curve, however, I find that Emacs is invaluable for rapid coding and for flexibly editing all different types of files in the same environment.
I remember how overwhelming it was to figure out how to do anything when I first got started, so in this tutorial I’m going to aim to give you the basics to get started. This is by no means a complete guide to Emacs (though if you have suggestions for things to add, I’d be happy to do so), but hopefully should be enough to start comfortably using Emacs as a text editor.
This tutorial is mainly for people who have primarily used GUI text editors and coding environments and are not used to a primarily text-based program, running commands in the editor itself, and/or using large amounts of keyboard shortcuts.
For reference, here is the list of shortcuts I’ll be introducing in this tutorial:
  • C-h C-h : help
  • C-g : quit
  • C-x b : switch buffers
  • C-x right : right-cycle through buffers
  • C-x left : left-cycle through buffers
  • C-x k : kill buffer
  • C-x 0 : close the active window
  • C-x 1 : close all windows except the active window
  • C-x 2 : split the active window vertically into two horizontal windows
  • C-x 3 : split the active window horizontally into two vertical windows
  • C-x o : change active window to next window
  • C-x C-f : open file
  • C-x C-s : save file
  • C-x C-w : save file as
  • C-space : set region mark
  • C-w : kill region
  • C-k : kill region between point and end of current line
  • M-w : kill region without deleting
  • C-y : yank region from kill ring
  • M-y : move to previous item in the kill ring
  • M-Y : move to next item in the kill ring
  • C-_ : undo
  • C-s : search forwards
  • C-r : search backwards
  • M-% : query replace (‘space’ to replace, ‘n’ to skip, ‘!’ to replace all)
  • M-q : wrap text
  • C-left : move one word left
  • C-right : move one word right
  • C-up : move one paragraph up
  • C-down : move one paragraph down
  • home : move to the beginning of the line
  • end : move to the end of the line
  • page up : move up a page
  • page down : move down a page
  • M- : move to end of buffer

Opening Emacs

When you first open Emacs, you will see a window that looks something like this (click to view larger image):
There is the standard menubar up at the top (With “File”, “Edit”, etc.) and a toolbar right below it. You can use those to explore what Emacs has to offer and to perform operations that you don’t know the keyboard shortcut for, but ultimately, try not to rely on them. The way to use Emacs efficiently is to learn how to navigate it using keyboard shortcuts.
Let’s go over some basic terminology first with regards to the layout of Emacs:
Although in standard terminology the running instance of Emacs would be called a window, in Emacs terminology it is called a frame. Within Emacs itself, there is a window in which we see the welcome “GNU Emacs” buffer (more on windows and buffers in a bit).
The blinking black cursor (over the W in “Welcome”) is called the point. Not only is it like a cursor in your standard text editor (where the text is inserted when you type), but it is the location where you will sometimes need to run functions as well (e.g., “change the word that the point is currently in to be uppercase”). We’ll come back to this later.
The grey bar at the bottom of the screen is the status barand displays various information about the point and the active buffer (there is one status bar per window). The white space below that is called the mini-buffer and will occasionally display status messages (e.g., after saving a file), and is also the place where you enter Emacs commands.

Keyboard Shortcuts

There are two very important keys in Emacs. The first is the “meta” key. For me, this is the “Alt” key (but it could also be the Windows key, for example). You will frequently see the meta key abbreviated as just “M”, e.g. M-x means the “meta key x key” combination.
The second important key is the “Ctrl” key. Like the meta key, you will see combinations with the key abbreviated as just “C”, e.g. C-f means the “ctrl key f key” combination.
You may also sometimes see key combos like C-c | vs. C-c C-|. THESE ARE DIFFERENT KEY COMBOS. The first means “control key c key, release, then | key”. The second means “control key c, release, then control key | key”.
The three most important keyboard shortcuts to know are C-h C-h (help), C-g (quit), and M-x (run command). The help command will put you in a position to figure out how to do something if you’re stuck, and the quit command will cancel an operation (for example, if you are entering a command at the mini-buffer, C-g will quit the mini-buffer and move the point back to the buffer you were in previously — see the next section for more details on buffers and the mini-buffer). The run command will let you run any command in Emacs; you probably won’t need to use it much right away, but it’s good to know if you run into a scenario where you do need to run a command.

Buffers and Windows

Before we go into actually opening files, I’m going to go into a bit more detail about buffers and windows. First, you can have many buffers open at once. Usually they display the contents of a file, but they can also display output from programs or other information. By default, Emacs creates a single window and displays the GNU Emacs buffer in it. It also always opens up a *Messages* buffer to display information and error messages about Emacs itself.
There is also always a *scratch* buffer, which is what it sounds like — a place for notes or other text you don’t want to save.
You can’t see the other buffers until you tell Emacs to view them through a window. To do this, use the C-x b key combination. This will move the point to the mini-buffer and display a message that looks like “Switch to buffer (default *scratch*)”:
(Remember that if you want to cancel the current operation, i.e. you decide you don’t want to switch buffers after all, you can use C-g to quit the mini-buffer).
You will frequently want to know all the buffers that are currently open so you can choose the correct one to switch to. To do this, just press the tab key from the mini-buffer prompt:
This opens a new window beneath the original one and creates a buffer to display the list of completions (notice that the *Completions* buffer is included in the list of buffers!). Note that you don’t have to use the tab completion if you don’t want to, but it is handy.
Type a buffer name into the mini-buffer (for example, *scratch*) and hit enter. This will close the window displaying the *Completions* buffer and open the *scratch* buffer in the window that had previously displayed the GNU Emacs buffer.
You can also cycle through buffers sequentially with the key combos C-x right and C-x left.
Revisiting the concept of windows: they are essentially just views into a buffer. You can open up multiple windows in the same frame (I usually use two vertical windows) and you can have multiple windows displaying the same buffer:
Again, this is just a view into the buffer, so if I edit the buffer in the left window, the changes will be reflected in the right window, because they are both displaying the same buffer:
There are a few relevant key commands for manipulating windows:
  • C-x 0 : close the active window
  • C-x 1 : close all windows except the active window
  • C-x 2 : split the active window vertically into two horizontal windows
  • C-x 3 : split the active window horizontally into two vertical windows
  • C-x o : change active window to next window
Note that closing a window does NOT mean that the buffer it is displaying is closed.
Why the distinction between windows and buffers? It is useful to have buffers open in the background even if they are not currently displayed through a window because reading and displaying a buffer from memory is much, much faster than reading and displaying a buffer from disk. So if you are frequently switching between five different files but you only have two windows open, it is better to open all files once, load them into buffers (memory), and then switch between the buffers instead.

Opening, saving, and closing buffers

To open a file and load it into a buffer, use C-x C-f, which will open a prompt in the mini-buffer that says “Find file: ~/path/to/current/directory/”. Just as with switching buffers, you can press the tab key to list the files in the directory you have specified if you need to.
You can then type in the name (and/or change the path) of the file you want:
Press enter, and a new buffer will be created with the file you specified:
If you make changes to the buffer and you want to save it back to the file on disk, use C-x C-s:
If you want to save the buffer under a new file name (basically, “Save As” functionality), use C-x C-w, which will prompt you to specify the file name:
If the file already exists, it will double check to see whether you are actually intending to overwrite the existing file:
Once you are done with the buffer and want to actually close/kill it, use C-x k, which will prompt you in the mini-buffer for the name of the buffer to kill (similar to the prompt given when switching buffers). If you don’t specify a buffer, it will kill the active buffer by default.
Emacs will display one of the other open buffers in the window that had previously contained the buffer you just killed.

Manipulating text

You are probably familiar with the terms “cut”, “copy”, and “paste” when it comes to manipulating regions of text. Emacs has these functions as well, however they are typically referred to under different names. The killoperation is analogous to “cut”, and yank is analogous to “paste”. There is more going on behind the scenes than just copying, cutting, and pasting; we’ll come back to this in a bit.
First, before I actually tell you how to kill and yank text, you need to know about the region. Just as you might select a region of text with the mouse to copy, you can do so in Emacs. However, this is usually done using the cursor and — you guessed it — more keyboard shortcuts.
To select a region, move the point to one end of your desired region. Then hit C-space; you will see a message in the mini-buffer saying “Mark set”:
Or possibly “Mark deactivated”, if you had previously set the mark (if this is the case, just press C-space again to reactivate the mark):
Now move the point to the other end of the region. The text which is in the region will become highlighted.
Now you can do an operation on the selected region (for example, killing it). To kill a selected region, use C-w:
You can also implicitly define a region from the point to the end of the current line and kill it with C-k.
To perform a yank, use C-y:
You can undo what you just did by pressing C-_ several times (note how the mini-buffer displays an “Undo!” message):
If you want to kill a region without actually deleting the text, use M-w. That doesn’t seem to make a lot of sense — kill a region without deleting it? What’s the difference?
The difference is that in basic text editors, you can only copy or cut one piece of text at a time. In Emacs, there is what is called a kill ring which can hold multiple regions of text that you have killed. When you yank text, you are yanking it off the kill ring and back into the buffer. So if I kill a region without deleting it, I am copying the contents of the region into the kill ring but not actually deleting the region from the buffer.
Here’s a demonstration. If I kill (without deleting) two lines sequentially and then perform a yank, Emacs copies the most recent item in the kill ring:
And if I yank again on the next line, it is again the most recent item in the kill ring:
And if I then use M-y without moving the point first, Emacs will replace the yanked text with the next item from the kill ring:
To move the opposite direction in the kill ring, use M-Y.

Other useful commands


To search in a buffer (or region, if you have one selected), use C-s (to search forward) or C-r (to search backward). Type your search query into the mini-buffer and keep pressing C-s or C-r to cycle through the results. When you reach the end of the search results, Emacs will display a “Failing I-search” message the mini-buffer. If you use the search key combo again, you will wrap over to the beginning of the results and Emacs will display the “Overwrapped I-search” message in the mini-buffer.

Find and replace

To find and replace a search query, use M-%. Enter in the text you want to find:
Then press enter and enter the text you want to replace it with:
Emacs will highlight the text to be replaced (just like if you were searching for it). Press ‘space’ to replace it or ‘n’ to skip it and go to the next one. Press ‘!’ to replace all queries.

Wrapping text

If you type a lot of words into Emacs, you will notice that it does not automatically wrap the text. This can be very annoying both for moving the point around if you are typing full paragraphs, because Emacs will treat the paragraph as a single line, and for readability.
To get around this, I use M-q to wrap the paragraph of text that the point is currently in:

Moving the point

It can be slow to move the point only with the arrow keys; these are the commands that I use to speed up navigation (note that there are some other commands you can use for the same operations — these are just the ones I find most convenient).
To move the point up or down a whole paragraph (instead of a single line), use C-up or C-down. To move the point past a whole word (instead of a single character), use C-left or C-right. To move to the beginning of a line, use the home key; to move to the end of a line, use the endkey. To move up or down a page, you can use the page upand page down keys. To move to the beginning of the buffer, use M-.