Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, September 26, 2013

Three ways to change a space delimited file to a tab delimited file

Let's create a space delimited file:

tommy@tommy-ThinkPad-T420:~$ cat > new.txt
chr1 100 200
chr2 300 400
chr3 240 700

ctrl + D to indicate cat that this is the last line.

use awk:

tommy@tommy-ThinkPad-T420:~$ cat new.txt | awk -F ' ' '{print $1"\t"$2"\t"$3}'
chr1    100 200
chr2    300 400
chr3    240 700

use sed:

tommy@tommy-ThinkPad-T420:~$ cat new.txt | sed 's/ /\t/g'   
chr1    100  200
chr2    300 400
chr3    240 700

use tr :
tommy@tommy-ThinkPad-T420:~$ cat new.txt | tr ' ' '\t'   
chr1    100 200
chr2 300 400
chr3 240 700

tommy@tommy-ThinkPad-T420:~$ rm new.txt

you can redirect the newly generated tab delimited file to a new file
tommy@tommy-ThinkPad-T420:~$ cat new.txt | tr ' ' '\t' > new1.txt

Linux commands are so cool!

Tuesday, September 24, 2013

finding the closest feature given a set of Genomic coordinates

I have a bed file which contains peak information for a ChIP-seq data. I want to find the genes that are closest to these peaks.

I am aware that many softwares can do this job. cistrome-Galaxy pipeline from shirley liu's lab in Harvard,  bioconductor package ChIPpeakAnno
are tools can deal with this kind of task. In addition to them, bedtool is a very good tool for Genomic Interval manipulation.  See a post here

However, I want to go a little bit further to learn the Interval Tree data structure to deal with this kind of problem as mentioned in my last post and the post from Biostar.

I will have a look at the bx-python package.

Friday, September 20, 2013

Genomic Interval Tree, Interval manipulation

we do a lot of work on overlapping genomic features.
Bedtool is a great tool for this kind of task, but sometimes I want to implement it with python
for some custom usage. I found many useful links below. I have to digest them to fully understand how the  tree is constructed.

I am by no means familiar with those algorithms.  Have a peek how other people do it, and learn from them is a good start for me.


Tuesday, September 17, 2013

Amazing GNU sort

I just came across this on Biostar:

Question: How to sort bed format file
Is there any tools or commands to sort bed format file? The correct order is chr1, chr2, chr3, chr4,..., chr22, chrX, chrY.

GNU sort can now sort in alpha-numeric order. See option "-V, --version-sort" of the following version "sort (GNU coreutils) 8.17".
Here is an example:
$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC"
chr10   1       2       A
chr2    1       2       B
chr1    1       2       C
And the result is:
$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC" | sort -k1,1V
chr1    1       2       C
chr2    1       2       B
chr10   1       2       A

Tophat run time

It took me 45 hrs to map a 24G single-end RNA-seq data to human hg19 with the gtf file from GENECODE! 

I run tophat on a cluster with 1 node, 8 processors with 3gb ram for each processor (the max free resource as a regular user). I have six such will take me 12 days to finish the mapping....wish I had better access to the computing resource...

a quick google confirmed that it does take that long...
"It routinely takes ~40 hours for me (~45 million reads, with - p 4 to use four threads and 16G RAM). It will delete the huge temp files it generates, so the disk usage may not be that much of any issue."

Detect mutations from RNA-seq data

Usually we detect mutations by DNA-sequencing or whole exome sequencing, GATK is the commonly used pipeline for this purpose.

But RNA-seq data can also be used to call mutations in addition to detect the differentially expressed genes. However, the pipelines like GATK is not suitable for this kind of data.

A software


was developed by Wei Li in Harvard Dana-Farber cancer Insitute for this purpose
See the post here


Friday, September 13, 2013

The Backtick in the linux command

I was doing my first RNA-seq mapping by Tophat, I downloaded the SRA file from GEO and dump them to fastq files (you need to install the sra-toolkit from the NCBI SRA website). I had total six sra files, and I do not want to write something:

fastq-dump SRR0001 --split-3
fastq-dump SRR0002 --split-3 

for six times.

that's where the backtick come to rescue. (it is usually on the same key as the tilde symbol ~)

The backtick allows you to assign the output of a shell command to a variable. You must surround the entire command line command with backtick characters.

for the task above, you can do:

for f in `ls *.sra`
fastq-dump --split-3 $f

it is a little bit awkward to write something like that, in fact
you can do:

for f in *.sra
fastq-dump --split-3 $f

however, if it becomes very handy if you want to do something like this:

# Run tophat

for fq in *fastq
FQ=`basename $fq .fastq`
tophat -p 8 -G gencode.v18.annotation.gtf -o $FQ
/project/bio/bowtie2/hg19 $fq

in this example, I created a variable FQ to capture the name of the fastq file ( man basename if you do not know it), and I used this variable as the output directory name.

pretty neat!

Thursday, September 12, 2013

FPKMs and counts compared for RNA-seq data

A very nice article comparing  FPKMs and raw counts in RNA-seq data analysis
  1. counts are simply the number of reads overlapping a given feature such as a gene.
  2. FPKMs or Fragments Per Kilobase of exon per Million reads are much more complicated.  Fragment means fragment of DNA, so the two reads that comprise a paired-end read count as one.  Per kilobase of exon means the counts of fragments are then normalized by dividing by the total length of all exons in the gene (or transcript).  This bit of magic makes it possible to compare Gene A to Gene B even if they are of different lengths.  Per million reads means this value is then normalized against the library size.  This bit of magic makes it possible to compare Gene A in Sample 1 to Sample 2 even if Sample 1′s RNA-seq library has 60 million pairs of reads and Sample 2′s library has only 30 million pairs of reads. "

Wednesday, September 11, 2013

reproduce k-means clustering and heatmap

I asked this question in seqanswer and biostar

was plotting a ChIP-seq data using the pheatmap, see code below:

km<- kmeans(m1,2) # determine how many cluster you want, I specify 2 here

m.kmeans<- cbind(m1, km$cluster) # combine the cluster with the matrix

# [1] 903 602
# the last column is 602

o<- order(m.kmeans[,602]) # order the last column

m.kmeans<- m.kmeans[o,] # order the matrix according to the order of the last column

pheatmap( m.kmeans[,1:601], cluster_rows = F, cluster_cols = F, col= hmcols, breaks = bk, legend=FALSE, show_rownames=FALSE, show_colnames=FALSE)

It works fine for me, I clustered the data to two groups by specifying K=2, the problem is that group 1 sometimes shows up in the upper part of the heatmap, sometimes it shows up in the bottom part of the figure if I plot it several times.

I think it has to do with the assignment of the group number, say, the first group is assigned to 1, the other is assigned to 2. However, next time if you plot the same data, the first group assigned to 2, the other is assigned to 1. R randomly assigns the number to the groups.

How can I  control this?

The answer is to use the set.seed() function.

"Yes, calling set.seed(foo) immediately prior to running kmeans(....) will give the same random start and hence the same clustering each time. foo is a seed, like 42 or some other numeric value."

"The seed argument in set.seed is a single value, interpreted as an integer (as defined inhelp(set.seed()). The seed in set.seed produces random values which are unique to thatseed (and will be same irrespective of the computer you run and hence ensures reproducibility). So the random values generated by set.seed(1) and set.seed(123) will not be the same but the random values generated by R in your computer using set.seed(1) and by R in my computer using the same seed are the same."

"For a normal RNG of decent quality, the value doesn't matter. "42" is a reference to a famous book; other people use their birthday or "123" or just "1"."