However, I am using some human data from this paper:
Epigenomic plasticity enables human pancreatic α to β cell reprogramming
http://www.ncbi.nlm.nih.gov/pubmed/23434589
I downloaded the SRA files, converted to fastq files ( 3 files for α cells 3 for β cells, each is around 24Gb)
I then mapped them with tophat http://crazyhottommy.blogspot.com/2013/09/the-backtick-in-linux-command.html
it took around 50 hrs for each file http://crazyhottommy.blogspot.com/2013/09/tophat-run-time.html
using 8 threads and 3Gb memory for each thread on the HPC cluster at University of Florida.
After tophat mapping, I got six bam files. each file is around 5Gb. Following the nature protocol, I need to sort them by name with samtools sort and convert to sam file for htseq-count http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count.
The problem I encountered is that the bam file is too big that the merging of the sorted small bam files never occur.
module load samtools
samtools sort -n my.bam my_sn
I googled and found http://www.biostars.org/p/18933/
Taking advantage of the multi-thread property of samtools:
samtools sort -n -@ 8 -m 3G my.bam my_sn
it worked and was much faster.
There are many ways to get the counts for each gene.
popular ones:
there are some issues for the bedtools multicov when feeding the bed files converted from gtf file.
as mentioned here http://genomespot.blogspot.com/2013/06/generate-rna-seq-count-matrix-part-1.html?showComment=1380634400271#c1772339925972310636
" The fact that in the above example and many others throughout the genome that there are overlapping exons. If we don't collapse them, then reads assigned to those overlapping regions could be counted multiple times and provide an inaccurate profile of expression.
The solution is to merge overlapping exons right? Well yes, but there is another major caveat in that exons from different genes can sometimes have overlapping exons and we can't collapse exons from different genes, otherwise we wouldn't be able to distinguish them. The solution is to merge exons which only belong to the same gene. If there is a read which aligns to a region with more than one annotation, then it is counted twice because we can't differentiate between the two genes at that position."
I am going to stick to the nature protocol using htseq-count (Simon Anders wrote this python package HTSeq), but please keep that in mind. Different sources of gtf files are a little bit different.
commonly we get them from UCSC table browser http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format, ENSEMBL ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/ or GENCODE http://www.gencodegenes.org/
first several lines of the gtf file from UCSC:
chr6 refGene exon 26924772 26924962 . + . gene_id "LINC00240"; transcript_id "NR_026775"; exon_number "1"; exon_id "NR_026775.1"; gene_name "LINC00240";
chr6 refGene exon 26936237 26936301 . + . gene_id "LINC00240"; transcript_id "NR_026775"; exon_number "2"; exon_id "NR_026775.2"; gene_name "LINC00240";
first several lines of the gtf file from ENSEMBL:
HG104_HG975_PATCH retained_intron exon 144766622 144766712 . + . gene_id "ENSG00000261516"; transcript_id "ENST00000561901"; exon_number "1"; gene_name "ZNF707"; gene_biotype "protein_coding"; transcript_name "ZNF707-012"; exon_id "ENSE00002614945";
HG104_HG975_PATCH retained_intron exon 144772227 144772293 . + . gene_id "ENSG00000261516"; transcript_id "ENST00000561901"; exon_number "2"; gene_name "ZNF707"; gene_biotype "protein_coding"; transcript_name "ZNF707-012"; exon_id "ENSE00003062908";
first several lines of the gtf file from GENCODE:
chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
when using htseq
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count, you need to specify the -i flag to gene_name if you prefer the gene names, otherwise, it will give you a count table with the ENSEMBL gene_id gene_id like "ENSG00000223972.4". Remember htseq is designed to count for genes http://seqanswers.com/forums/showthread.php?t=18068.
on the HPC cluster:
module load htseq
htseq-count -s no -a 10 -t exon -i gene_name my_sn.sam gencode.v18.annotation.gtf > my.count
the sam file is around 25Gb, it took around 1.5 hr to finish the counting for each file.
Finally, I got the count table for all of them.
first several lines of the count file
0R7H2P 20
5S_rRNA 54
7SK 170
A1BG 14
A1BG-AS1 64
A1CF 292
A2M 252
A2M-AS1 14
A2ML1 352
A2ML1-AS1 14
A2ML1-AS2 4
A2MP1 91
Now, I am ready to use DESeq R package for the downstream analysis. I will post later.
Great post Tommy! I have found featureCounts to be quite useful for read counting. It has many options and is much faster than HT-Seq or Bedtools Multicov. It can even be run from within R.
ReplyDeletehttp://www.ncbi.nlm.nih.gov/pubmed/24227677
Thank you, I keep learning everyday. I am aware of featureCounts and it looks like it runs faster!
DeleteDear tommy and freinds, hello, I used HTseq count for my data. As an imput I just put a sam file in the command line of HTseq and ultimately I got a file output. Then I imported to my R studio to analyses RNA seq by the DESEQ2. But I don not know how i can use the HTseq count columns or variable received by HTseq as input for deseq2. For example the first columns of my HTseq conut output is the name of genes list but I do not know other columns and how they should be used in Rstudio for the DEseq analyzing. Would you please help me out and tell me which columns of HTseq should be considered after introducing them?
Deletemy eamil is hamid_fiuji@yahoo.com i would appreciate if you could send your reply to my
email
sincerely yours
Hamid
Where can I download the SRA files?
ReplyDeletego to NCBI GEO and search http://www.ncbi.nlm.nih.gov/geo/
DeleteI use htseq-count for a RNAseq, it constantly gives error Error occured when reading beginning of SAM/BAM file.
ReplyDelete[Exception type: StopIteration, raised in count.py:84]. Do you know how to solve this problem. Thanks
I just googled and found this http://seqanswers.com/forums/showthread.php?t=36022
DeleteAll thanks to Mr Anderson for helping with my profits and making my fifth withdrawal possible. I'm here to share an amazing life changing opportunity with you. its called Bitcoin / Forex trading options. it is a highly lucrative business which can earn you as much as $2,570 in a week from an initial investment of just $200. I am living proof of this great business opportunity. If anyone is interested in trading on bitcoin or any cryptocurrency and want a successful trade without losing notify Mr Anderson now.Whatsapp: (+447883246472 )
ReplyDeleteEmail: tdameritrade077@gmail.com