Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Saturday, April 6, 2013

How to make TSS plot using RNA-seq and ChIP-seq data

many times, we want to plot the ChIP-seq signal across the TSS in a genome wide scale.
I figured out how to do it by several ways. Using  Encode K562 cells H3K4Me3 ChIP-seq data as an example 

1.  Using HTSeq python package 

1) Download the H3K4me3 ChIP-seq and RNA-seq data sets.
wgEncodeBroadHistoneK562H3k4me1StdAlnRep1.bam                   29-Oct-2010 11:03  919M 

2) prepare the gtf file for hg19 version
http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

$ cat $HOME/.hg.conf
db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password
And set the permissions:
$ chmod 600 .hg.conf
Now you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file refGene.gtf:
$ genePredToGtf -utr hg19 refGene refGene.gtf
add -utr option to add 5'UTR information which contains the TSS position.
for RNA-seq data analysis we need the refGene.gtf (contains TSS and exons) 
for ChIP-seq plotting at TSS, we need to extract TSSs from refGene.gtf
it can be done easily by linux command line
cat refGene.gtf | grep 5UTR > hg19.TSS.gtf
3)  group the genes to high, medium, low expression by analyzing the RNA-seq data
#RNA-Seq analysis
#The following block is to sort the gene expression data into three groups based on level of expression.
#Use python to count the number of tags of genes obtained via RNA-Seq and write that information into a new output file.
bam file is converted to sam file by samtools, because the htseq count script only takes sam file as input.
samtools view -h -o K562RNAseqRep1V2.sam K562RNAseqRep1V2.bam
python -m HTSeq.scripts.count -s no K562RNASeqRep1V2.sam hg19_UTR_exon.gtf > K562_htseq.count.out
#stranded=count of how many tags in each gene. Add up all the counts for each exon.
#Once we have the K562_htseq_count.out file, first cound the number of genes that are there.
cat K562_htseq_count.out | head -n -5 > K562_htseq_count.out.clean # get rid of the last five lines which  are the summary of the count results
cat K562_htseq_count.out.clean | wc -l
#For the working file we had 23705 counts.
#So each group will have 1/3 of the total (7902 genes) and highest tag density will be the top 33% followed by mid 33% and low 33%.
#Now sort the results into three groups depending on tag density in Linux command line.
#Sort the file according to the second column and group into three groups.
cat K562_htseq.count.out.clean | sort -k2,2nr| head -7902 > top33_percent.txt
cat K562_htseq.count.out.clean | sort -k2,2nr| tail -15803 | head -7902 > mid33_percent.txt
cat K562_htseq.count.out.clean | sort -k2,2nr| tail -7902 > low33_percent.txt
4)python code:
update on 10/20/13, I put my code in gist and embed it here
it generates figure below.
2.  use the ngsplot package 
follow the installation instruction.
at command line
ngs.plot.r -G hg19 -R genebody -C  config.k562.txt -O K562.H3k4me3.genebody -T H3k4me3.genebody -L 3000 -FL 300
# this is different from the plotting in method 1, I am plotting across the gene body,
# you can also use -R tss  to produce the same plot above.
# the bam file must be sorted by samtools first
the config.k562.txt file is like this: tab delimited 
K562H3k4me3.sorted.bam          top30_percent.txt           "High" 
K562H3k4me3.sorted.bam           medium30_percent.txt   "med" 
K562H3k4me3.sorted.bam           low30_percent.txt           "low"
it generates a tss plot and a heat map 
===============================
update on 11/19/13 
see this post on biostar http://www.biostars.org/p/83800/ it is very slow though, I tried it on my desktop (~4Gb Ram one core), it took some time to finish.

13 comments:

  1. if you are not familiar with linux and python.
    you can try Seqmonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/Help/3%20Visualisation/3.2%20Figures%20and%20Graphs/3.2.6%20The%20Probe%20Trend%20Plot.html
    or Seqminer http://bips.u-strasbg.fr/seqminer/tiki-index.php?page=Clusters+heatmap&structure=seqMINER+Wiki

    They can produce similar graphs.

    ReplyDelete
  2. This tool from shirley liu's lab at harvard can make similar plots:
    http://liulab.dfci.harvard.edu/CEAS/

    ReplyDelete
    Replies
    1. For the CEAS package, a .bed and .wig file is required. What is the best way to convert a .bam into the .bed and .wig that CEAAS will accept? I have found several scripts that do the conversion into .wig, but I am not sure the proper way to make both so that CEAS will accept it.

      Thank you very much!

      Delete
    2. Hi, just a quick google
      use the bedtool http://seqanswers.com/forums/showthread.php?t=7638

      a post from Biostar is also very informative
      http://www.biostars.org/p/2699/

      Delete
  3. Is it possible to create a similar picture without the bam files? I got a file with chromosome, start, end, distance, score. What is the best way to approach?

    ReplyDelete
    Replies
    1. Is it a bed peak file generated by MACS or other peak calling software?

      Delete
    2. Sorry for the delayed responds. I got MACS output file, with chromosome, start, end, tags ect. is it possible to use your Python script?

      Delete
    3. MACS outputs the bed file and that's the peak file, you can not use this script for this purpose. you need to use the raw read file bam file or the wiggle file.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Hii Tommy Tang

    Can you please provide a python script for plotting the RNQ Seq data to find differentially expressed genes. Thank you !!

    ReplyDelete
  6. I really need these resource, are they all free for using? CP

    ReplyDelete
  7. Great jobs !
    Now I know how to draw that kind of figures by myself .
    Thank you very much.

    ReplyDelete