Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, October 31, 2013

python code for getting the reverse complement DNA strand

Following pythonforbiologist, I wrote a very simple python script to get the reverse complement DNA strand:

It is a very simple script but several things I want to mention:
1. what is the fast string concatenation method in python?
in the first function, we tend to write something like:

out_seq = ' '
for base in reversed(seq):
    out_seq += seq_dict[base]

but it turned out to be very slow. Read this post for more details
the list comprehension is the fastest and most elegant way to concatenate a string.

2. for more complicated IUPAC ambiguity codes, read this post:
Nucleotide Code:  Base:
----------------  -----
T (or U)..........Thymine (or Uracil)
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base
. or

Wednesday, October 30, 2013

My first play with GRO-seq data, from sam to bedgraph for visualization

GRO-seq (Global run on sequencing) is a technique to capture nascent RNAs in a global manner.  Different from RNA-seq which sequences the matured mRNA to infer gene expression level, GRO-seq sequences the newly generated short RNAs to infer RNApol2 binding or the production of enhancer RNAs.

It was first carried out by John T.Lis at Cornell University. See their science paper in 2008:
Nascent RNA Sequencing Reveals Widespread Pausing and Divergent Initiation at Human Promoters

Since then, many labs use this technique to study their own subjects.
Michael G. Rosenfeld from UCSD just published a nature paper in June, 

Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation

I want to have a look at the data set since I work with MCF7 breast cancer cells a lot.
although they have a bigwig file there which one can direct upload to UCSC genome browser or local IGV to visualize, the data were mapped to hg18.

I want to have a hg19 version. So, I went to download the raw sequence sra file

then I used the sra toolkit to dump it to fastq file and mapped the data with bowtie 
in UF HPC terminal:

module load sra
fastq-dump SRR817000.sra

the resulting fastq file is around 9Gb.
PBS file for mapping:

[mtang@dev1 MCF7_GRO_seq]$ cat bowtie.pbs 
#PBS -N bowtie
#PBS -m abe
#PBS -o bowtie.test.out
#PBS -e bowtie.test.err
#PBS -l nodes=1:ppn=8
#PBS -l pmem=3000mb
#PBS -l walltime=10:30:00


# Load the module for bowtie
module load bowtie

# Run bowtie
bowtie -p 8 --best  /project/bio/bowtie/hg19  -q SRR817000.fastq  -S  SRR817000.sam

Essentially, the sequences are mapped back to genomic DNA as you see.
It took me around 30 mins to finish the mapping which is pretty fast compared with RNA-seq data. 

using Tophat, same configuration as the pbs file above, it takes me 50hrs to finish mapping. RNA-seq data are bigger (~25Gb fastq) and more complicated when mapping (figure out exon-exon junctions).

Now, I have the sam file after bowtie mapping. 
Next, I want to generate a bedGraph file for visualization.

There are several ways to do it:

I strongly recommend you read the tutorials to understand the various sequencing techniques. Although the formatting is not that great, it is still pretty informative.

first make Tag directory 
makeTagDirectory MCF7_gro_seq_ctrl/  SRR817000.sam

then make the bedgraph file for plus and minus strand respectively:

makeUCSCfile MCF7_gro_seq_ctrl/  -strand + -o minus.bedGraph
makeUCSCfile MCF7_gro_seq_ctrl/ -strand - -o plus.bedGraph

2. use bedtools genomecov

bedtools genomecov -ibam SRR817000.sorted.bam -bg -strand + > bedtools_plus.bedGraph 
bedtools genomecov -ibam SRR817000.sorted.bam -bg -strand - > bedtools_minus.bedGraph 

the bam file needs to be sorted first with samtools

it takes much longer for bedtools compared with Homer (Homer generates the tagDirectory for later quick access). gzip the bedGraph file to .gz

all the files I have:
[mtang@dev1 MCF7_GRO_seq]$ ls -sh .
total 19G
 87M bedtools_plus.bedGraph.gz  4.0K fetchChromSizes     16M plus.bedGraph.gz  1.5G SRR817000.sorted.bam
4.0K bowtie.pbs                 4.0K hg19.chrom.sizes   1.9G SRR817000.bam     1.2G SRR817000.sra
3.1M bowtie.test.err            4.0K MCF7_gro_seq_ctrl  8.6G SRR817000.fastq
   0 bowtie.test.out             14M minus.bedGraph.gz  5.6G SRR817000.sam

I then imported the bedgraph files to IGV and had a look at my favourite gene VEGFA.
I published my first paper on it:) 

Restraint of angiogenesis by zinc finger transcription factor CTCF-dependent chromatin insulation

The first track is the minus-strand bedgraph got from Homer
the second track is the plus-strand bedgraph got from Homer
the third track is the plus-strand bedgraph got from bedtools genomecov

I see clearly short RNAs spiking  at the promoter of VEGFa, suggesting that RNApol2 paused at promoter and does not undergo productive elongation. This gene is actively transcribed with full-length mRNA, but the promoter is still paused. ( read the 2008 science GRO-seq paper for more details of different groups of genes)

one thing I do noticed is that the bedgraph file from bedtools is bigger than the one from Homer 87M vs 16M, and if I zoom in I saw some small differences:

It looks like the bedgraph from bedtools exhibits a more detailed view of the data, but overall either bedgraph file would be fine for a quick inspection of interested genes.( I guess Homer uses some bigger bin for calculating or merge bins with small difference together to save space).

Finally, if you want to make a bigwig file for UCSC genome browser, you need to use bedGraphToBigWig  from here

Thursday, October 24, 2013

useful unix one liners for bioinformatics

I have been using ubuntu 12.04LTS for one year, and I really love it and do not look back at windows at all.  Currently, I have dual booting system for linux and windows, but now I barely fire the windows.

The power of Linux comes from, at least in one aspect, its command line. As I have stated before, in  the field of bioinformatics, we spend a lot of time in reformatting, cleaning, extracting the data. Linux commands become very handy in these routine jobs.
several good resources:
1. From the author of BWA samtools Heng Li
2. Stephen Turner from UVA
4. UT at Austin's+list+of+linux+one-liners very good resource for NGS analysis.

Most commonly used ones for me are:
head, tail, wc, tr, sort, uniq, cat, cut, paste, join, grep(or the more powerful ack), find, xargs, comm, diff, awk, sed etc

Everyday, I learn new stuffs.
yesterday, from twitter, I learned this one:  
Compare 2 arbitrary columns of different files: paste <(cut -f2 file1.txt) <(cut -f7 file2.txt) | awk '{if ($1 != $2) { print "do stuff"} }'

some others:
let's say you have a fasta file contain multiple sequences, and you want to split it to many files with one record per file.

tommy@tommy-ThinkPad-T420:~$ cat contig 

tommy@tommy-ThinkPad-T420:~$ cat contig | awk  '/^>/{close("out"n);n++}{print > "out"n}'
tommy@tommy-ThinkPad-T420:~$ ls out*
out1  out2  out3
tommy@tommy-ThinkPad-T420:~$ cat out1
tommy@tommy-ThinkPad-T420:~$ cat out2
tommy@tommy-ThinkPad-T420:~$ cat out3

print hidden characters ( cat -v is not what I want)
tommy@tommy-ThinkPad-T420:~$ sed -n l contig

it can also print out the tabs as \t

I have a bed file that contains the Transcription start sites information, some genes have multiple records because they have isoforms, but I only want to use the unique TSSs for my downstream analysis. of course, you can do cut -f1-3 | sort | uniq, but I just show you another way to do it by awk

tommy@tommy-ThinkPad-T420:~$ head tss_-3kb_+3kb_hg19.txt 
chr1 66996824 67002824 NM_032291 +
chr1 50486626 50492626 NM_032785 -
chr1 33543713 33549713 NM_052998 +
chr1 8381389 8387389 NM_001080397 +
chr1 25068759 25074759 NM_013943 +
chr1 16764166 16770166 NM_018090 +
chr1 16764166 16770166 NM_001145278 +
chr1 16764166 16770166 NM_001145277 +
chr1 92368559 92374559 NM_001195684 -
chr1 92348836 92354836 NM_001195683 -

you can see a record with identical first three columns, but with three different transcript ids. To get only the first occurrence of the TSS

tommy@tommy-ThinkPad-T420:~$ head tss_-3kb_+3kb_hg19.txt | awk '!a[$1,$2,$3]++'
chr1 66996824 67002824 NM_032291 +
chr1 50486626 50492626 NM_032785 -
chr1 33543713 33549713 NM_052998 +
chr1 8381389 8387389 NM_001080397 +
chr1 25068759 25074759 NM_013943 +
chr1 16764166 16770166 NM_018090 +
chr1 92368559 92374559 NM_001195684 -
chr1 92348836 92354836 NM_001195683 -

That just gives you a flavour of how powerful command lines are:)

Friday, October 18, 2013

group the same ID to the same line by python, use awk to do SQL like groupby and sum

I was on Seqanswer forum and found someone asked a question:
"Hi guys,
i hv got a GO file for my differentially expressed genes file, it goes like:

FBgn00001 GO:0016301 [Name:****(annotation)]
FBgn00002 GO:0016301 [Name:****(annotation)]
FBgn00003 GO:0016301 [Name:****(annotation)]
FBgn00004 GO:0003700 [Name:****(annotation)]
FBgn00004 GO:0009651 [Name:****(annotation)]
FBgn00004 GO:0006355 [Name:****(annotation)]
FBgn00005 GO:0009556 [Name:****(annotation)]
FBgn00005 GO:0005515 [Name:****(annotation)]
FBgn00005 GO:0080019 [Name:****(annotation)]
FBgn00005 GO:0016563 [Name:****(annotation)]
FBgn00005 GO:0016627 [Name:****(annotation)]
FBgn00006 GO:0003700 [Name:****(annotation)]
FBgn00006 GO:0010018 [Name:****(annotation)]

now i want to use WEGO ,so i need to convert it like:

FBgn00001 GO:0016301
FBgn00002 GO:0016301
FBgn00003 GO:0016301
FBgn00004 GO:0003700 GO:0009651 GO:0006355
FBgn00005 GO:0009556 GO:0005515 GO:0080019 GO:0016563 GO:0016627
FBgn00006 GO:0003700 GO:0010018

I think this could be solved using a perl script. I am not able to do this since i am a beginner. Can someone help me out? A simple perl script is good enough for me^^ "

The question is on very basic text manipulation, and I quickly wrote a python script for that:

import csv
reader = csv.reader(open("GO.txt","r"), delimiter="\t")
for row in reader:
    if row[0] not in new.keys():
        new[row[0]] = [row[1]]

with open("wego.txt","w") as f:
    for key, value in sorted(new.items()):

update 10/19/13, I used the gist in github to display the code

basically, I store the ID as the key, and a list of GO terms as the value, and then write the dictionary
to a txt file.
I was thinking if there are any easier ways to do it, I mean by awk or sed.
A quick google search

the question is a little bit different from the one above:
"I have a large file containing data like this:
a 23
b 8
a 22
b 1
I want to be able to get this:
a 45
b 9

awk array is very handy for that purpose see a post here
let's first creat the file:
tommy@tommy-ThinkPad-T420:~$ cat > foo.txt
a 23
b 8
a 22
b 1
#control+D to indicate cat this is the end of the file.

tommy@tommy-ThinkPad-T420:~$ cat foo.txt | awk '{new[$1]+=$2}END{for (key in new) print key, new[key]}'
a 45
b 9

#or you can redirect the result to a new file

tommy@tommy-ThinkPad-T420:~$ cat foo.txt | awk '{new[$1]+=$2}END{for (key in new) print key, new[key]}' > new.txt

I do not know whether the array value can hold multiple strings like a list in python.
updated on 10/21/13,  an awk one liner can solve this problem

tommy@tommy-ThinkPad-T420:~$ cat foo.txt | awk '{ if (new[$1]) new[$1]=new[$1]"\t"$2; else new[$1]=$2;} END { for (i in new) print i, new[i]}' OFS="\t"
a 23 22
b 8 1

Amazing awk!

Friday, October 11, 2013

A beginner’s guide to bioinformatics Fwd Homologus

See the posts here
and here

5 layers of expertise in bioinformatics were mentioned:
"Layer 1 – Using web to analyze biological data
Layer 2 – Ability to install and run new programs
Layer 3 – Writing own scripts for analysis in PERL, python or R
Layer 4 – High level coding in C/C++/Java for implementing existing algorithms or modifying existing codes for new functionality
Layer 5 – Thinking mathematically, developing own algorithms and implementing in C/C++/Java"
level 1-3 are required for all the biologists, while level 4-5 are left to experts. I am trying to accomplish level 3, and that will be good enough for my research:)

Tuesday, October 8, 2013

Does every new biology PhD student need to learn how to program?

The answer is Yes!
see the paper from Nature Biotech


Box 1: Does every new biology PhD student need to learn how to program?

Durbin: That's a little like asking is it good for a molecular biologist to know chemistry. I would say that computation is now as important to biology as chemistry is. Both are useful background knowledge. Data manipulation and use of information are part of the technology of biology research now. Knowing how to program also gives people some idea about what's going on inside data analysis. It helps them appreciate what they can and can't expect from data analysis software.
Trapnell: It's probably not just that experimental biologists need to program, but it's also enormously helpful when computational folks learn how to do experiments. For me, for example, coming from a computer science background, the opposite way of thinking was hard to learn. How do I learn to argue with wet-lab data? How do I learn what to trust, what to distrust, how to cross-validate things? That's a radically different way of thinking when you're used to proofs and writing code and validating it on a computer.
Krzywinski: To some, the answer might be “no” because that's left to the experts, to the people downstairs who sit in front of a computer. But a similar question would be: does every graduate student in biology need to learn grammar? Clearly, yes. Do they all need to learn to speak? Clearly, yes. We just don't leave it to the literature experts. That's because we need to communicate. Do students need to tie their shoes? Yes. It has now come to the point where using a computer is as essential as brushing your teeth. If you want some kind of a competitive edge, you're going to want to make as much use of that computer as you can. The complexity of the task at hand will mean that canned solutions don't exist. It means that if you're using a canned solution, you're not at the edge of research.
Robinson: Yes. Even if they don't program in their research, they will have to use software and likely will communicate with software developers. It helps tremendously to have some basic knowledge. Additionally, in the research environment, the ability to do basic tasks in the Linux/Unix environment is essential.
Rasband: All scientists should learn how to program."

Monday, October 7, 2013

choose random lines from a file

I have a bed file which contains promoter regions. I want to get 1000 random promoters from it. First I think I would write a python script using the random module, but before I do it, I did a quick google search since it is such an ordinary job, there may be tools out there already.

I like the sort  -R solution
Sort the file randomly and pick first 100 lines:
$ sort -R input | head -n 100 >output

but it would be very slow if the file is very huge. The promoter file only contains 50k lines, so it is not a very big deal.

It turns out that shuf is a built-in command for this kind of task.
Use shuf as shown below, to get N random lines:
shuf -n N input > output

of course, there are other solutions using awk and sed, but why not use this simple one:)
Linux is awesome!

By the way, the random command from the bedtools  generates random bed Intervals from a genome. If that's what you want, use it.

Thursday, October 3, 2013

compare ChIP-seq data for different conditions and different transcription factors?

I have done a lot of ChIP-seq data analysis, but most of them are just mapping to reference genome and then calling peaks (use MACS from Shirley liu's lab at Harvard). A little bit more is to intersect different peak bed files with bedtools and to make heatmaps.

The reads mapping is much simpler and faster than RNA-seq data. The data are smaller (~5Gb compare to RNA-seq ~25Gb). I recently asked myself how to compare different ChIP-seq data for different conditions ( control vs knockdown, different developmental stages etc).

I quick google search:

The available packages are:
"- ChIPDiff, which you already mentioned but which I haven't tried, but it was actually developed for histone marks so I'd be surprised if it didn't work for those?! I recall someone saying that there was some other issue with it (sorry I can't be more specific)

- DiffBind, which you also mentioned (; it uses DESeq internally

- DBChIP (, which appears to use edgeR

- You can also use edgeR and DESeq directly. The DESeq paper shows you how to re-analyze differential TF binding data in the Kasowski et al Science paper ( "

After reading a little bit, I decided to give diffReps a try.  The author says it is suitable for both histone modifications and transcription factors.
The paper was published in PLOSONE, it compares the results  with  that from DESeq and others.
The author Shen Li developed the other package ngsplot.

Another interesting R package is here
mentioned by me before
I will have to give it a try also.

At the same time, I am reading the Nature protocol:
 2011 Dec 15;7(1):45-61. doi: 10.1038/nprot.2011.420.

A computational pipeline for comparative ChIP-seq analyses.

it gives you a much better understanding of how to make the comparisons.

update 11/19/13

A table from the newly published paper

Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data

Table S3.  Software packages for the analysis of differential binding in ChIP-seq.
Software tool
ChIPDiff [36]
Differential histone modification sites using a hidden Markov model
Comparative ChIP-seq [25]
Fold change ratio between normalized peak heights
DBChIP [33]
Assigns uncertainty measures in a test of non-differential binding (uses edgeR)
DESeq§ [31]
Test based on a model using the negative binomial distribution
Differential binding affinity analysis (uses edgeR and DESeq)
DIME [35]
Differential identification using mixtures ensemble
edgeR§ [32]
Empirical Bayes estimation and exact tests based on the negative binomial distribution
MACS [17] (version 2)
Differential peak detection based on paired four bedGraph files
MAnorm [34]
Robust regression to derive a linear model
Differences in shape using Kernel methods
Shape-based analysis of variation using functional PCA
Non-linear normalization on RNA Pol II profiling

§ Originally developed for gene expression count data.

updated on 02/06/2014

Two more useful links:

Tuesday, October 1, 2013

RNA-seq analysis samtools sort and htseq-count

I am following the RNA-seq nature protocol from Simon Anders. See my previous post here:

However, I am using some human data from this paper:
Epigenomic plasticity enables human pancreatic α to β cell reprogramming

I downloaded the SRA files, converted to fastq files ( 3 files for α cells 3 for β cells, each is around 24Gb)
I then mapped them with tophat
it took around 50 hrs for each file
using 8 threads and 3Gb memory for each thread on the HPC cluster at University of Florida.

After tophat mapping, I got six bam files. each file is around 5Gb. Following the nature protocol, I need to sort them by name with samtools sort and convert to sam file for htseq-count

The problem I encountered is that the bam file is too big that the merging of the sorted small bam files never occur.

module load samtools
samtools sort -n my.bam my_sn

I googled and found
Taking advantage of the multi-thread property of samtools:

samtools sort -n -@ 8 -m 3G my.bam my_sn

it worked and was much faster.

There are many ways to get the counts for each gene. 
popular ones:

there are some issues for the bedtools multicov when feeding the bed files converted from gtf file.

" The fact that in the above example and many others throughout the genome that there are overlapping exons. If we don't collapse them, then reads assigned to those overlapping regions could be counted multiple times and provide an inaccurate profile of expression.

The solution is to merge overlapping exons right? Well yes, but there is another major caveat in that exons from different genes can sometimes have overlapping exons and we can't collapse exons from different genes, otherwise we wouldn't be able to distinguish them. The solution is to merge exons which only belong to the same gene. If there is a read which aligns to a region with more than one annotation, then it is counted twice because we can't differentiate between the two genes at that position."

I am going to stick to the nature protocol using htseq-count (Simon Anders wrote this python package HTSeq), but please keep that in mind. Different sources of gtf files are a little bit different. 

first several lines of the gtf file from UCSC:
chr6    refGene exon    26924772        26924962        .       +       .       gene_id "LINC00240"; transcript_id "NR_026775"; exon_number "1"; exon_id "NR_026775.1"; gene_name "LINC00240";
chr6    refGene exon    26936237        26936301        .       +       .       gene_id "LINC00240"; transcript_id "NR_026775"; exon_number "2"; exon_id "NR_026775.2"; gene_name "LINC00240";

first several lines of the gtf file from ENSEMBL:

HG104_HG975_PATCH       retained_intron exon    144766622       144766712       .       +       .        gene_id "ENSG00000261516"; transcript_id "ENST00000561901"; exon_number "1"; gene_name "ZNF707"; gene_biotype "protein_coding"; transcript_name "ZNF707-012"; exon_id "ENSE00002614945";
HG104_HG975_PATCH       retained_intron exon    144772227       144772293       .       +       .        gene_id "ENSG00000261516"; transcript_id "ENST00000561901"; exon_number "2"; gene_name "ZNF707"; gene_biotype "protein_coding"; transcript_name "ZNF707-012"; exon_id "ENSE00003062908";

first several lines of the gtf file from GENCODE:
chr1    HAVANA  gene    11869   14412   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

when using htseq, you need to specify the -i flag to gene_name if you prefer the gene names, otherwise, it will give you a count table with the ENSEMBL gene_id  gene_id like "ENSG00000223972.4". Remember htseq is designed to count for genes

-i <id attribute>--idattr=<id attribute>
GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.
on the HPC cluster:
module load htseq
htseq-count -s no -a 10 -t exon -i gene_name my_sn.sam gencode.v18.annotation.gtf > my.count

the sam file is around 25Gb, it took around 1.5 hr to finish the counting for each file.

Finally, I got the count table for all of them. 
first several lines of the count file
0R7H2P  20
5S_rRNA 54
7SK     170
A1BG    14
A1BG-AS1        64
A1CF    292
A2M     252
A2M-AS1 14
A2ML1   352
A2ML1-AS1       14
A2ML1-AS2       4
A2MP1   91

Now, I am ready to use DESeq R package for the downstream analysis.  I will post later.