This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday, October 31, 2013

python code for getting the reverse complement DNA strand

Following pythonforbiologist http://pythonforbiologists.com/index.php/business-card-solutions/, I wrote a very simple python script to get the reverse complement DNA strand:

It is a very simple script but several things I want to mention:
1. what is the fast string concatenation method in python?
in the first function, we tend to write something like:

out_seq = ' '
for base in reversed(seq):
out_seq += seq_dict[base]

but it turned out to be very slow. Read this post for more details http://www.skymind.com/~ocrow/python_string/
the list comprehension is the fastest and most elegant way to concatenate a string.

2. for more complicated IUPAC ambiguity codes, read this post:
http://pythonforbiologists.com/index.php/how-to-count-non-dna-bases-in-a-sequence-using-python/

Nucleotide Code:  Base:
----------------  -----
A.................Adenine
C.................Cytosine
G.................Guanine
T (or U)..........Thymine (or Uracil)
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base
. or -............gap

Wednesday, October 30, 2013

My first play with GRO-seq data, from sam to bedgraph for visualization

GRO-seq (Global run on sequencing) is a technique to capture nascent RNAs in a global manner. Different from RNA-seq which sequences the matured mRNA to infer gene expression level, GRO-seq sequences the newly generated short RNAs to infer RNApol2 binding or the production of enhancer RNAs.

It was first carried out by John T.Lis at Cornell University. See their science paper in 2008:
Nascent RNA Sequencing Reveals Widespread Pausing and Divergent Initiation at Human Promoters
http://www.ncbi.nlm.nih.gov/pubmed/19056941

Since then, many labs use this technique to study their own subjects.
Michael G. Rosenfeld from UCSD just published a nature paper in June,

Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation

http://www.nature.com/nature/journal/v498/n7455/full/nature12210.html

I want to have a look at the data set since I work with MCF7 breast cancer cells a lot.

They put the data in GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE45822

although they have a bigwig file there which one can direct upload to UCSC genome browser http://www.biostars.org/p/42844/ or local IGV to visualize, the data were mapped to hg18.

I want to have a hg19 version. So, I went to download the raw sequence sra file

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR817/SRR817000/SRR817000.sra

then I used the sra toolkit to dump it to fastq file and mapped the data with bowtie

in UF HPC terminal:

module load sra

fastq-dump SRR817000.sra

the resulting fastq file is around 9Gb.

PBS file for mapping:

[mtang@dev1 MCF7_GRO_seq]$ cat bowtie.pbs

#!/bin/bash

#PBS -N bowtie

#PBS -M mtang@ufl.edu

#PBS -m abe

#PBS -o bowtie.test.out

#PBS -e bowtie.test.err

#PBS -l nodes=1:ppn=8

#PBS -l pmem=3000mb

#PBS -l walltime=10:30:00

cd $PBS_O_WORKDIR

# Load the module for bowtie

module load bowtie

# Run bowtie

bowtie -p 8 --best /project/bio/bowtie/hg19 -q SRR817000.fastq -S SRR817000.sam

Essentially, the sequences are mapped back to genomic DNA as you see.

It took me around 30 mins to finish the mapping which is pretty fast compared with RNA-seq data.

using Tophat, same configuration as the pbs file above, it takes me 50hrs to finish mapping. RNA-seq data are bigger (~25Gb fastq) and more complicated when mapping (figure out exon-exon junctions).

Now, I have the sam file after bowtie mapping.

Next, I want to generate a bedGraph file for visualization.

There are several ways to do it:

1. use Homer http://biowhat.ucsd.edu/homer/

I strongly recommend you read the tutorials to understand the various sequencing techniques. Although the formatting is not that great, it is still pretty informative.

first make Tag directory

makeTagDirectory MCF7_gro_seq_ctrl/ SRR817000.sam

then make the bedgraph file for plus and minus strand respectively:

makeUCSCfile MCF7_gro_seq_ctrl/ -strand + -o minus.bedGraph

makeUCSCfile MCF7_gro_seq_ctrl/ -strand - -o plus.bedGraph

2. use bedtools genomecov

bedtools genomecov -ibam SRR817000.sorted.bam -bg -strand + > bedtools_plus.bedGraph

bedtools genomecov -ibam SRR817000.sorted.bam -bg -strand - > bedtools_minus.bedGraph

the bam file needs to be sorted first with samtools

it takes much longer for bedtools compared with Homer (Homer generates the tagDirectory for later quick access). gzip the bedGraph file to .gz

all the files I have:

[mtang@dev1 MCF7_GRO_seq]$ ls -sh .

total 19G

87M bedtools_plus.bedGraph.gz 4.0K fetchChromSizes 16M plus.bedGraph.gz 1.5G SRR817000.sorted.bam

4.0K bowtie.pbs 4.0K hg19.chrom.sizes 1.9G SRR817000.bam 1.2G SRR817000.sra

3.1M bowtie.test.err 4.0K MCF7_gro_seq_ctrl 8.6G SRR817000.fastq

0 bowtie.test.out 14M minus.bedGraph.gz 5.6G SRR817000.sam

3. Use HTSeq python package: http://www-huber.embl.de/users/anders/HTSeq/doc/tour.html#tour

I then imported the bedgraph files to IGV and had a look at my favourite gene VEGFA.

I published my first paper on it:)

Restraint of angiogenesis by zinc finger transcription factor CTCF-dependent chromatin insulation

http://www.ncbi.nlm.nih.gov/pubmed/21896759

The first track is the minus-strand bedgraph got from Homer

the second track is the plus-strand bedgraph got from Homer

the third track is the plus-strand bedgraph got from bedtools genomecov

I see clearly short RNAs spiking at the promoter of VEGFa, suggesting that RNApol2 paused at promoter and does not undergo productive elongation. This gene is actively transcribed with full-length mRNA, but the promoter is still paused. ( read the 2008 science GRO-seq paper for more details of different groups of genes)

one thing I do noticed is that the bedgraph file from bedtools is bigger than the one from Homer 87M vs 16M, and if I zoom in I saw some small differences:

It looks like the bedgraph from bedtools exhibits a more detailed view of the data, but overall either bedgraph file would be fine for a quick inspection of interested genes.( I guess Homer uses some bigger bin for calculating or merge bins with small difference together to save space).

Finally, if you want to make a bigwig file for UCSC genome browser, you need to use bedGraphToBigWig from here http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

and follow http://www.biostars.org/p/42844/.

further readings: http://www.biostars.org/p/64495/#64680
http://davetang.org/muse/2012/03/15/visualising-rna-seq-like-data/
http://davetang.org/muse/2012/03/15/ucsc-genome-browser-custom-overlap-tracks/

Thursday, October 24, 2013

useful unix one liners for bioinformatics

I have been using ubuntu 12.04LTS for one year, and I really love it and do not look back at windows at all. Currently, I have dual booting system for linux and windows, but now I barely fire the windows.

The power of Linux comes from, at least in one aspect, its command line. As I have stated before, in the field of bioinformatics, we spend a lot of time in reformatting, cleaning, extracting the data. Linux commands become very handy in these routine jobs.
several good resources:
1. From the author of BWA samtools Heng Li http://lh3lh3.users.sourceforge.net/biounix.shtml
2. Stephen Turner from UVA http://gettinggeneticsdone.blogspot.com/2013/10/useful-linux-oneliners-for-bioinformatics.html
3. http://genomics-array.blogspot.com/2010/11/some-unixperl-oneliners-for.html
4. UT at Austin https://wikis.utexas.edu/display/bioiteam/Scott's+list+of+linux+one-liners very good resource for NGS analysis.

Most commonly used ones for me are:
head, tail, wc, tr, sort, uniq, cat, cut, paste, join, grep(or the more powerful ack), find, xargs, comm, diff, awk, sed etc

Everyday, I learn new stuffs.
yesterday, from twitter, I learned this one:

‏@vsbuffalo22h

Compare 2 arbitrary columns of different files: paste <(cut -f2 file1.txt) <(cut -f7 file2.txt) | awk '{if ($1 != $2) { print "do stuff"} }'

some others:
let's say you have a fasta file contain multiple sequences, and you want to split it to many files with one record per file.

tommy@tommy-ThinkPad-T420:~$ cat contig
>contig1
ATCGGGTC
>contig2
GCTCGTTCAA
>contig3
TACGGGGT

tommy@tommy-ThinkPad-T420:~$ cat contig | awk '/^>/{close("out"n);n++}{print > "out"n}'
tommy@tommy-ThinkPad-T420:~$ ls out*
out1 out2 out3
tommy@tommy-ThinkPad-T420:~$ cat out1
>contig1
ATCGGGTC
tommy@tommy-ThinkPad-T420:~$ cat out2
>contig2
GCTCGTTCAA
tommy@tommy-ThinkPad-T420:~$ cat out3
>contig3
TACGGGGT

print hidden characters ( cat -v is not what I want)
tommy@tommy-ThinkPad-T420:~$ sed -n l contig
>contig1$
ATCGGGTC$
>contig2$
GCTCGTTCAA$
>contig3$
TACGGGGT$

it can also print out the tabs as \t

I have a bed file that contains the Transcription start sites information, some genes have multiple records because they have isoforms, but I only want to use the unique TSSs for my downstream analysis. of course, you can do cut -f1-3 | sort | uniq, but I just show you another way to do it by awk

tommy@tommy-ThinkPad-T420:~$ head tss_-3kb_+3kb_hg19.txt
chr1 66996824 67002824 NM_032291 +
chr1 50486626 50492626 NM_032785 -
chr1 33543713 33549713 NM_052998 +
chr1 8381389 8387389 NM_001080397 +
chr1 25068759 25074759 NM_013943 +
chr1 16764166 16770166 NM_018090 +
chr1 16764166 16770166 NM_001145278 +
chr1 16764166 16770166 NM_001145277 +
chr1 92368559 92374559 NM_001195684 -
chr1 92348836 92354836 NM_001195683 -

you can see a record with identical first three columns, but with three different transcript ids. To get only the first occurrence of the TSS

tommy@tommy-ThinkPad-T420:~$ head tss_-3kb_+3kb_hg19.txt | awk '!a[$1,$2,$3]++'
chr1 66996824 67002824 NM_032291 +
chr1 50486626 50492626 NM_032785 -
chr1 33543713 33549713 NM_052998 +
chr1 8381389 8387389 NM_001080397 +
chr1 25068759 25074759 NM_013943 +
chr1 16764166 16770166 NM_018090 +
chr1 92368559 92374559 NM_001195684 -
chr1 92348836 92354836 NM_001195683 -

That just gives you a flavour of how powerful command lines are:)

Friday, October 18, 2013

group the same ID to the same line by python, use awk to do SQL like groupby and sum

I was on Seqanswer forum and found someone asked a question:
http://seqanswers.com/forums/showthread.php?p=119263&posted=1#post119263
"Hi guys,
i hv got a GO file for my differentially expressed genes file, it goes like:

FBgn00001 GO:0016301 [Name:****(annotation)]
FBgn00002 GO:0016301 [Name:****(annotation)]
FBgn00003 GO:0016301 [Name:****(annotation)]
FBgn00004 GO:0003700 [Name:****(annotation)]
FBgn00004 GO:0009651 [Name:****(annotation)]
FBgn00004 GO:0006355 [Name:****(annotation)]
FBgn00005 GO:0009556 [Name:****(annotation)]
FBgn00005 GO:0005515 [Name:****(annotation)]
FBgn00005 GO:0080019 [Name:****(annotation)]
FBgn00005 GO:0016563 [Name:****(annotation)]
FBgn00005 GO:0016627 [Name:****(annotation)]
FBgn00006 GO:0003700 [Name:****(annotation)]
FBgn00006 GO:0010018 [Name:****(annotation)]

now i want to use WEGO ,so i need to convert it like:

FBgn00001 GO:0016301
FBgn00002 GO:0016301
FBgn00003 GO:0016301
FBgn00004 GO:0003700 GO:0009651 GO:0006355
FBgn00005 GO:0009556 GO:0005515 GO:0080019 GO:0016563 GO:0016627
FBgn00006 GO:0003700 GO:0010018

I think this could be solved using a perl script. I am not able to do this since i am a beginner. Can someone help me out? A simple perl script is good enough for me^^ "

The question is on very basic text manipulation, and I quickly wrote a python script for that:

import csv
reader = csv.reader(open("GO.txt","r"), delimiter="\t")
new={}
for row in reader:
if row[0] not in new.keys():
new[row[0]] = [row[1]]
else:
new[row[0]].append(row[1])

with open("wego.txt","w") as f:
for key, value in sorted(new.items()):
f.write(key+"\t"+"\t".join(value)+"\n")

update 10/19/13, I used the gist in github to display the code

basically, I store the ID as the key, and a list of GO terms as the value, and then write the dictionary
to a txt file.
I was thinking if there are any easier ways to do it, I mean by awk or sed.
A quick google search
http://stackoverflow.com/questions/10286522/group-by-sum-from-shell/10289422#10289422

the question is a little bit different from the one above:

"I have a large file containing data like this:

a 23
b 8
a 22
b 1

I want to be able to get this:

a 45
b 9

"

awk array is very handy for that purpose see a post here http://www.thegeekstuff.com/2010/03/awk-arrays-explained-with-5-practical-examples/:
let's first creat the file:
tommy@tommy-ThinkPad-T420:~$ cat > foo.txt
a 23
b 8
a 22
b 1
#control+D to indicate cat this is the end of the file.

tommy@tommy-ThinkPad-T420:~$ cat foo.txt | awk '{new[$1]+=$2}END{for (key in new) print key, new[key]}'
a 45
b 9

#or you can redirect the result to a new file

tommy@tommy-ThinkPad-T420:~$ cat foo.txt | awk '{new[$1]+=$2}END{for (key in new) print key, new[key]}' > new.txt

I do not know whether the array value can hold multiple strings like a list in python.
-------------------------------------------
updated on 10/21/13, an awk one liner can solve this problem

tommy@tommy-ThinkPad-T420:~$ cat foo.txt | awk '{ if (new[$1]) new[$1]=new[$1]"\t"$2; else new[$1]=$2;} END { for (i in new) print i, new[i]}' OFS="\t"
a 23 22
b 8 1

Amazing awk!

Friday, October 11, 2013

A beginner’s guide to bioinformatics Fwd Homologus

See the posts here http://www.homolog.us/blogs/blog/2011/07/22/a-beginners-guide-to-bioinformatics-part-i/
and here http://www.homolog.us/blogs/blog/2011/07/22/a-beginners-guide-to-bioinformatics-part-ii/

5 layers of expertise in bioinformatics were mentioned:

"Layer 1 – Using web to analyze biological data
Layer 2 – Ability to install and run new programs
Layer 3 – Writing own scripts for analysis in PERL, python or R
Layer 4 – High level coding in C/C++/Java for implementing existing algorithms or modifying existing codes for new functionality
Layer 5 – Thinking mathematically, developing own algorithms and implementing in C/C++/Java"

level 1-3 are required for all the biologists, while level 4-5 are left to experts. I am trying to accomplish level 3, and that will be good enough for my research:)

Tuesday, October 8, 2013

Does every new biology PhD student need to learn how to program?

The answer is Yes!
see the paper from Nature Biotech http://www.nature.com/nbt/journal/v31/n10/box/nbt.2721_BX1.html

"

Box 1: Does every new biology PhD student need to learn how to program?

Durbin: That's a little like asking is it good for a molecular biologist to know chemistry. I would say that computation is now as important to biology as chemistry is. Both are useful background knowledge. Data manipulation and use of information are part of the technology of biology research now. Knowing how to program also gives people some idea about what's going on inside data analysis. It helps them appreciate what they can and can't expect from data analysis software.

Trapnell: It's probably not just that experimental biologists need to program, but it's also enormously helpful when computational folks learn how to do experiments. For me, for example, coming from a computer science background, the opposite way of thinking was hard to learn. How do I learn to argue with wet-lab data? How do I learn what to trust, what to distrust, how to cross-validate things? That's a radically different way of thinking when you're used to proofs and writing code and validating it on a computer.

Krzywinski: To some, the answer might be “no” because that's left to the experts, to the people downstairs who sit in front of a computer. But a similar question would be: does every graduate student in biology need to learn grammar? Clearly, yes. Do they all need to learn to speak? Clearly, yes. We just don't leave it to the literature experts. That's because we need to communicate. Do students need to tie their shoes? Yes. It has now come to the point where using a computer is as essential as brushing your teeth. If you want some kind of a competitive edge, you're going to want to make as much use of that computer as you can. The complexity of the task at hand will mean that canned solutions don't exist. It means that if you're using a canned solution, you're not at the edge of research.

Robinson: Yes. Even if they don't program in their research, they will have to use software and likely will communicate with software developers. It helps tremendously to have some basic knowledge. Additionally, in the research environment, the ability to do basic tasks in the Linux/Unix environment is essential.

Rasband: All scientists should learn how to program."

Monday, October 7, 2013

choose random lines from a file

I have a bed file which contains promoter regions. I want to get 1000 random promoters from it. First I think I would write a python script using the random module, but before I do it, I did a quick google search since it is such an ordinary job, there may be tools out there already.

http://stackoverflow.com/questions/9245638/select-random-lines-from-a-file-in-bash
http://stackoverflow.com/questions/448005/whats-an-easy-way-to-read-random-line-from-a-file-in-unix-command-line

I like the sort -R solution

Sort the file randomly and pick first 100 lines:

$ sort -R input | head -n 100 >output

but it would be very slow if the file is very huge. The promoter file only contains 50k lines, so it is not a very big deal.

It turns out that shuf is a built-in command for this kind of task.

Use shuf as shown below, to get N random lines:

shuf -n N input > output

of course, there are other solutions using awk and sed, but why not use this simple one:)
Linux is awesome!

By the way, the random command from the bedtools http://bedtools.readthedocs.org/en/latest/content/tools/random.html generates random bed Intervals from a genome. If that's what you want, use it.

Thursday, October 3, 2013

compare ChIP-seq data for different conditions and different transcription factors?

I have done a lot of ChIP-seq data analysis, but most of them are just mapping to reference genome and then calling peaks (use MACS from Shirley liu's lab at Harvard). A little bit more is to intersect different peak bed files with bedtools and to make heatmaps.

The reads mapping is much simpler and faster than RNA-seq data. The data are smaller (~5Gb compare to RNA-seq ~25Gb). I recently asked myself how to compare different ChIP-seq data for different conditions ( control vs knockdown, different developmental stages etc).

I quick google search:
http://seqanswers.com/forums/showthread.php?t=31048&highlight=compare+ChIP-seq
http://www.biostars.org/p/10609/
http://www.biostars.org/p/42291/

The available packages are:
"- ChIPDiff, which you already mentioned but which I haven't tried, but it was actually developed for histone marks so I'd be surprised if it didn't work for those?! I recall someone saying that there was some other issue with it (sorry I can't be more specific)

- DiffBind, which you also mentioned (http://www.bioconductor.org/packages...iffBind.html); it uses DESeq internally

- DBChIP (http://pages.cs.wisc.edu/~kliang/DBChIP/), which appears to use edgeR

- You can also use edgeR and DESeq directly. The DESeq paper shows you how to re-analyze differential TF binding data in the Kasowski et al Science paper (http://www.sciencemag.org/content/328/5975/232.short). "

After reading a little bit http://seqanswers.com/forums/showthread.php?t=31048&highlight=compare+ChIP-seq, I decided to give diffReps http://code.google.com/p/diffreps/ a try. The author says it is suitable for both histone modifications and transcription factors.
The paper was published in PLOSONE, it compares the results with that from DESeq and others.
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0065598
The author Shen Li developed the other package ngsplot.

Another interesting R package is here http://genomebiology.com/2013/14/4/R38
mentioned by me before http://crazyhottommy.blogspot.com/2013/07/jmosaics-joint-analysis-of-multiple.html
I will have to give it a try also.

At the same time, I am reading the Nature protocol:
http://www.nature.com/nprot/journal/v7/n1/full/nprot.2011.420.html

2011 Dec 15;7(1):45-61. doi: 10.1038/nprot.2011.420.

A computational pipeline for comparative ChIP-seq analyses.

it gives you a much better understanding of how to make the comparisons.

=========================
update 11/19/13

A table from the newly published paper

Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data

http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003326

Table S3. Software packages for the analysis of differential binding in ChIP-seq.

Software tool	Availability	Notes
ChIPDiff [36]	http://cmb.gis.a-star.edu.sg/ChIPSeq/paperChIPDiff.htm	Differential histone modification sites using a hidden Markov model
Comparative ChIP-seq [25]	http://www.starklab.org/data/bardet_natprotoc_2011/	Fold change ratio between normalized peak heights
DBChIP [33]	http://pages.cs.wisc.edu/~kliang/DBChIP/	Assigns uncertainty measures in a test of non-differential binding (uses edgeR)
DESeq^§[31]	http://www.bioconductor.org/packages/release/bioc/html/DESeq.html	Test based on a model using the negative binomial distribution
DiffBind	http://www.bioconductor.org/packages/release/bioc/html/DiffBind.html	Differential binding affinity analysis (uses edgeR and DESeq)
DIME [35]	http://cran.r-project.org/web/packages/DIME/	Differential identiﬁcation using mixtures ensemble
edgeR^§[32]	http://www.bioconductor.org/packages/release/bioc/html/edgeR.html	Empirical Bayes estimation and exact tests based on the negative binomial distribution
MACS [17] (version 2)	https://github.com/taoliu/MACS/	Differential peak detection based on paired four bedGraph files
MAnorm [34]	http://bcb.dfci.harvard.edu/~gcyuan/MAnorm/MAnorm.htm	Robust regression to derive a linear model
MMDiff	http://bioconductor.org/packages/release/bioc/html/MMDiff.html	Differences in shape using Kernel methods
NarrowPeaks	http://bioconductor.org/packages/release/bioc/html/NarrowPeaks.html	Shape-based analysis of variation using functional PCA
POLYPHEMUS [37]	http://cran.r-project.org/web/packages/polyphemus/	Non-linear normalization on RNA Pol II profiling

^§Originally developed for gene expression count data.

=====================
updated on 02/06/2014

Two more useful links:
http://www.r-bloggers.com/methods-of-calling-differential-region-of-chip-seq/
http://omictools.com/differential-peak-calling/

Tuesday, October 1, 2013

RNA-seq analysis samtools sort and htseq-count

I am following the RNA-seq nature protocol from Simon Anders. See my previous post here:http://crazyhottommy.blogspot.com/2013/08/two-nature-protocols-for-rna-seq.html

However, I am using some human data from this paper:
Epigenomic plasticity enables human pancreatic α to β cell reprogramming
http://www.ncbi.nlm.nih.gov/pubmed/23434589

I downloaded the SRA files, converted to fastq files ( 3 files for α cells 3 for β cells, each is around 24Gb)
I then mapped them with tophat http://crazyhottommy.blogspot.com/2013/09/the-backtick-in-linux-command.html
it took around 50 hrs for each file http://crazyhottommy.blogspot.com/2013/09/tophat-run-time.html
using 8 threads and 3Gb memory for each thread on the HPC cluster at University of Florida.

After tophat mapping, I got six bam files. each file is around 5Gb. Following the nature protocol, I need to sort them by name with samtools sort and convert to sam file for htseq-count http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count.

The problem I encountered is that the bam file is too big that the merging of the sorted small bam files never occur.

module load samtools

samtools sort -n my.bam my_sn

I googled and found http://www.biostars.org/p/18933/

Taking advantage of the multi-thread property of samtools:

samtools sort -n -@ 8 -m 3G my.bam my_sn

it worked and was much faster.

There are many ways to get the counts for each gene.

there are some issues for the bedtools multicov when feeding the bed files converted from gtf file.

as mentioned here http://genomespot.blogspot.com/2013/06/generate-rna-seq-count-matrix-part-1.html?showComment=1380634400271#c1772339925972310636

and here http://www.cureffi.org/2013/09/12/counts-vs-fpkms-in-rna-seq/

" The fact that in the above example and many others throughout the genome that there are overlapping exons. If we don't collapse them, then reads assigned to those overlapping regions could be counted multiple times and provide an inaccurate profile of expression.

The solution is to merge overlapping exons right? Well yes, but there is another major caveat in that exons from different genes can sometimes have overlapping exons and we can't collapse exons from different genes, otherwise we wouldn't be able to distinguish them. The solution is to merge exons which only belong to the same gene. If there is a read which aligns to a region with more than one annotation, then it is counted twice because we can't differentiate between the two genes at that position."

I am going to stick to the nature protocol using htseq-count (Simon Anders wrote this python package HTSeq), but please keep that in mind. Different sources of gtf files are a little bit different.

commonly we get them from UCSC table browser http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format, ENSEMBL ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/ or GENCODE http://www.gencodegenes.org/

first several lines of the gtf file from UCSC:

chr6 refGene exon 26924772 26924962 . + . gene_id "LINC00240"; transcript_id "NR_026775"; exon_number "1"; exon_id "NR_026775.1"; gene_name "LINC00240";

chr6 refGene exon 26936237 26936301 . + . gene_id "LINC00240"; transcript_id "NR_026775"; exon_number "2"; exon_id "NR_026775.2"; gene_name "LINC00240";

first several lines of the gtf file from ENSEMBL:

HG104_HG975_PATCH retained_intron exon 144766622 144766712 . + . gene_id "ENSG00000261516"; transcript_id "ENST00000561901"; exon_number "1"; gene_name "ZNF707"; gene_biotype "protein_coding"; transcript_name "ZNF707-012"; exon_id "ENSE00002614945";

HG104_HG975_PATCH retained_intron exon 144772227 144772293 . + . gene_id "ENSG00000261516"; transcript_id "ENST00000561901"; exon_number "2"; gene_name "ZNF707"; gene_biotype "protein_coding"; transcript_name "ZNF707-012"; exon_id "ENSE00003062908";

first several lines of the gtf file from GENCODE:

chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

when using htseq

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count, you need to specify the -i flag to gene_name if you prefer the gene names, otherwise, it will give you a count table with the ENSEMBL gene_id gene_id like "ENSG00000223972.4". Remember htseq is designed to count for genes http://seqanswers.com/forums/showthread.php?t=18068.

-i <id attribute>, --idattr=<id attribute>

GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.

on the HPC cluster:

module load htseq

htseq-count -s no -a 10 -t exon -i gene_name my_sn.sam gencode.v18.annotation.gtf > my.count

the sam file is around 25Gb, it took around 1.5 hr to finish the counting for each file.

Finally, I got the count table for all of them.

first several lines of the count file

0R7H2P 20

5S_rRNA 54

7SK 170

A1BG 14

A1BG-AS1 64

A1CF 292

A2M 252

A2M-AS1 14

A2ML1 352

A2ML1-AS1 14

A2ML1-AS2 4

A2MP1 91

Now, I am ready to use DESeq R package for the downstream analysis. I will post later.

Diving into Genetics and Genomics

My github papge