Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, October 1, 2013

RNA-seq analysis samtools sort and htseq-count

I am following the RNA-seq nature protocol from Simon Anders. See my previous post here:

However, I am using some human data from this paper:
Epigenomic plasticity enables human pancreatic α to β cell reprogramming

I downloaded the SRA files, converted to fastq files ( 3 files for α cells 3 for β cells, each is around 24Gb)
I then mapped them with tophat
it took around 50 hrs for each file
using 8 threads and 3Gb memory for each thread on the HPC cluster at University of Florida.

After tophat mapping, I got six bam files. each file is around 5Gb. Following the nature protocol, I need to sort them by name with samtools sort and convert to sam file for htseq-count

The problem I encountered is that the bam file is too big that the merging of the sorted small bam files never occur.

module load samtools
samtools sort -n my.bam my_sn

I googled and found
Taking advantage of the multi-thread property of samtools:

samtools sort -n -@ 8 -m 3G my.bam my_sn

it worked and was much faster.

There are many ways to get the counts for each gene. 
popular ones:

there are some issues for the bedtools multicov when feeding the bed files converted from gtf file.

" The fact that in the above example and many others throughout the genome that there are overlapping exons. If we don't collapse them, then reads assigned to those overlapping regions could be counted multiple times and provide an inaccurate profile of expression.

The solution is to merge overlapping exons right? Well yes, but there is another major caveat in that exons from different genes can sometimes have overlapping exons and we can't collapse exons from different genes, otherwise we wouldn't be able to distinguish them. The solution is to merge exons which only belong to the same gene. If there is a read which aligns to a region with more than one annotation, then it is counted twice because we can't differentiate between the two genes at that position."

I am going to stick to the nature protocol using htseq-count (Simon Anders wrote this python package HTSeq), but please keep that in mind. Different sources of gtf files are a little bit different. 

first several lines of the gtf file from UCSC:
chr6    refGene exon    26924772        26924962        .       +       .       gene_id "LINC00240"; transcript_id "NR_026775"; exon_number "1"; exon_id "NR_026775.1"; gene_name "LINC00240";
chr6    refGene exon    26936237        26936301        .       +       .       gene_id "LINC00240"; transcript_id "NR_026775"; exon_number "2"; exon_id "NR_026775.2"; gene_name "LINC00240";

first several lines of the gtf file from ENSEMBL:

HG104_HG975_PATCH       retained_intron exon    144766622       144766712       .       +       .        gene_id "ENSG00000261516"; transcript_id "ENST00000561901"; exon_number "1"; gene_name "ZNF707"; gene_biotype "protein_coding"; transcript_name "ZNF707-012"; exon_id "ENSE00002614945";
HG104_HG975_PATCH       retained_intron exon    144772227       144772293       .       +       .        gene_id "ENSG00000261516"; transcript_id "ENST00000561901"; exon_number "2"; gene_name "ZNF707"; gene_biotype "protein_coding"; transcript_name "ZNF707-012"; exon_id "ENSE00003062908";

first several lines of the gtf file from GENCODE:
chr1    HAVANA  gene    11869   14412   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

when using htseq, you need to specify the -i flag to gene_name if you prefer the gene names, otherwise, it will give you a count table with the ENSEMBL gene_id  gene_id like "ENSG00000223972.4". Remember htseq is designed to count for genes

-i <id attribute>--idattr=<id attribute>
GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.
on the HPC cluster:
module load htseq
htseq-count -s no -a 10 -t exon -i gene_name my_sn.sam gencode.v18.annotation.gtf > my.count

the sam file is around 25Gb, it took around 1.5 hr to finish the counting for each file.

Finally, I got the count table for all of them. 
first several lines of the count file
0R7H2P  20
5S_rRNA 54
7SK     170
A1BG    14
A1BG-AS1        64
A1CF    292
A2M     252
A2M-AS1 14
A2ML1   352
A2ML1-AS1       14
A2ML1-AS2       4
A2MP1   91

Now, I am ready to use DESeq R package for the downstream analysis.  I will post later.


  1. Great post Tommy! I have found featureCounts to be quite useful for read counting. It has many options and is much faster than HT-Seq or Bedtools Multicov. It can even be run from within R.

    1. Thank you, I keep learning everyday. I am aware of featureCounts and it looks like it runs faster!

    2. Dear tommy and freinds, hello, I used HTseq count for my data. As an imput I just put a sam file in the command line of HTseq and ultimately I got a file output. Then I imported to my R studio to analyses RNA seq by the DESEQ2. But I don not know how i can use the HTseq count columns or variable received by HTseq as input for deseq2. For example the first columns of my HTseq conut output is the name of genes list but I do not know other columns and how they should be used in Rstudio for the DEseq analyzing. Would you please help me out and tell me which columns of HTseq should be considered after introducing them?
      my eamil is i would appreciate if you could send your reply to my
      sincerely yours

  2. Where can I download the SRA files?

    1. go to NCBI GEO and search

  3. I use htseq-count for a RNAseq, it constantly gives error Error occured when reading beginning of SAM/BAM file.

    [Exception type: StopIteration, raised in]. Do you know how to solve this problem. Thanks

    1. I just googled and found this