Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Saturday, April 6, 2013

How to make TSS plot using RNA-seq and ChIP-seq data

many times, we want to plot the ChIP-seq signal across the TSS in a genome wide scale.
I figured out how to do it by several ways. Using  Encode K562 cells H3K4Me3 ChIP-seq data as an example 

1.  Using HTSeq python package 

1) Download the H3K4me3 ChIP-seq and RNA-seq data sets.
wgEncodeBroadHistoneK562H3k4me1StdAlnRep1.bam                   29-Oct-2010 11:03  919M 

2) prepare the gtf file for hg19 version
http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

$ cat $HOME/.hg.conf
db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password
And set the permissions:
$ chmod 600 .hg.conf
Now you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file refGene.gtf:
$ genePredToGtf -utr hg19 refGene refGene.gtf
add -utr option to add 5'UTR information which contains the TSS position.
for RNA-seq data analysis we need the refGene.gtf (contains TSS and exons) 
for ChIP-seq plotting at TSS, we need to extract TSSs from refGene.gtf
it can be done easily by linux command line
cat refGene.gtf | grep 5UTR > hg19.TSS.gtf
3)  group the genes to high, medium, low expression by analyzing the RNA-seq data
#RNA-Seq analysis
#The following block is to sort the gene expression data into three groups based on level of expression.
#Use python to count the number of tags of genes obtained via RNA-Seq and write that information into a new output file.
bam file is converted to sam file by samtools, because the htseq count script only takes sam file as input.
samtools view -h -o K562RNAseqRep1V2.sam K562RNAseqRep1V2.bam
python -m HTSeq.scripts.count -s no K562RNASeqRep1V2.sam hg19_UTR_exon.gtf > K562_htseq.count.out
#stranded=count of how many tags in each gene. Add up all the counts for each exon.
#Once we have the K562_htseq_count.out file, first cound the number of genes that are there.
cat K562_htseq_count.out | head -n -5 > K562_htseq_count.out.clean # get rid of the last five lines which  are the summary of the count results
cat K562_htseq_count.out.clean | wc -l
#For the working file we had 23705 counts.
#So each group will have 1/3 of the total (7902 genes) and highest tag density will be the top 33% followed by mid 33% and low 33%.
#Now sort the results into three groups depending on tag density in Linux command line.
#Sort the file according to the second column and group into three groups.
cat K562_htseq.count.out.clean | sort -k2,2nr| head -7902 > top33_percent.txt
cat K562_htseq.count.out.clean | sort -k2,2nr| tail -15803 | head -7902 > mid33_percent.txt
cat K562_htseq.count.out.clean | sort -k2,2nr| tail -7902 > low33_percent.txt
4)python code:
update on 10/20/13, I put my code in gist and embed it here
#! /usr/bin/env python
# group the genes according to expression level
# analyze RNAseq data by counting tags for each gene using HTSeq.scripts.count or use bedtools muticov
# it genrates a file (K562_htseq_count.out.clean) with two columns, column 1 are gene names, column 2 are
#counts that mapped to all the exons of the same gene.
# compare the counts from different methods! and visualize them in IGV browser.
# top 30% midum 30% and low 30% gene names were obtained by linux command line
# sort -k2 -nrs K562_htseq_count.out.clean | wc -l
# sort -k2 -nrs K562_htseq_count.out.clean | head -n7992 > top30_percent.txt
# note that the counts are orignial raw data which need to be normalized, but for this simple test, the
# counts are proportional to the expression levles especially in the same sample.
# to compare gene expression from different samples, need to take account in the library size, the
# depth of sequencing and the length of genes.(longer genes have more counts by chance)
# to calculate the differential expression of genes, use the bioconductor(R) package DEseq which was written
# by Simon Andrew, the same guy wrote the HTSeq!(Seqmonk is written by hime also...) or differential exon usage by DEXseq package.
# samtools was used to prepare sorted and indexed bam file. bam file was converted to sam file by samtools also.
# HTSeq.scripts.count only takes sam file as input.
# 03/12/13 modified for shell use
import sys
import os
Usage='''this program takes in five input files, sortedbamfile,
GFF file and grouped gene lists. it generates the transcription factor
binding or histone modification profiling around the TSS in a genome wide scale.
How to use:
in a shell type
HTSeq_TSS4.py sorted.bam hg19.gtf top30_gene.txt medium30_gene.txt low30_gene.txt
'''
if len(sys.argv) <2 :
print Usage
else:
sortedbamfile=sys.argv[1]
gtf_file=sys.argv[2]
top30_genes=sys.argv[3]
medium30_genes=sys.argv[4]
low30_genes=sys.argv[5]
def TSS_Profile(ifile1,ifile2,ifile3):
'''read in three files, ifile1 is the sortedbamfile prepared by samtool
ifile2 is the GFF file, ifile3 is the grouped gene list'''
import sys
import HTSeq
import numpy
import itertools
sortedbamfile=HTSeq.BAM_Reader(ifile1)
gtffile=HTSeq.GFF_Reader(ifile2)
group_genes=open(ifile3)
halfwinwidth=3000
fragmentsize=200
gtf_dict={} #make a dictionary, key is the gene_name, value is the iv.start_d_as_pos
for feature in gtffile:
gtf_dict[feature.name]=feature.iv.start_d_as_pos
tsspos=set() #extract TSS for each group of genes
for line in group_genes:
linelist=line.split()
try:
tsspos.add(gtf_dict[linelist[0]])
except:
continue #in case the key is not in the gtf file
# highexpr=set()
#for line in open(highexpression):
# highexpr.add( line.split()[0] )
# for feature in gtf_file:
# if feature.name in highexpr:
# do_something()
#As you can see, using a container to store the information from one file in memory
#allows you to process your two files separately and so avoid the nested loop. This,
# of course is a very general and basic design pattern that you will encounter very often.
#(The different kinds of data containers that programming languages offer and how and when
#to use them is probably the next thing you need to study to further your programming skills.)
for tss in itertools.islice(tsspos,10):
sys.stderr.write("first 10 transcription start sites in %s are %s \n" %(ifile3,tss))
profile=numpy.zeros(2*halfwinwidth, dtype='i')
for p in tsspos:
try:
window=HTSeq.GenomicInterval(p.chrom, p.pos-halfwinwidth-fragmentsize,p.pos+halfwinwidth+fragmentsize,".")
for almnt in sortedbamfile[window]:
almnt.iv.length=fragmentsize
if p.strand=="+":
start_in_window=almnt.iv.start- p.pos +halfwinwidth
end_in_window =almnt.iv.end - p.pos +halfwinwidth
else:
start_in_window=p.pos+halfwinwidth-almnt.iv.end
end_in_window =p.pos+halfwinwidth-almnt.iv.start
start_in_window=max(start_in_window,0)
end_in_window=min(end_in_window, 2*halfwinwidth)
if start_in_window >= 2*halfwinwidth or end_in_window <0:
continue
profile[start_in_window : end_in_window] +=1
except:
continue
return profile
halfwinwidth=3000
profile1=TSS_Profile(sortedbamfile,gtf_file,top30_genes)
profile2=TSS_Profile(sortedbamfile,gtf_file,medium30_genes)
profile3=TSS_Profile(sortedbamfile,gtf_file,low30_genes)
from matplotlib import pyplot
import numpy
line1=pyplot.plot(numpy.arange(-halfwinwidth, halfwinwidth), profile1, color="red",label="high")
line2=pyplot.plot(numpy.arange(-halfwinwidth, halfwinwidth), profile2, color="blue",label="medium")
line3=pyplot.plot(numpy.arange(-halfwinwidth, halfwinwidth), profile3, color="green",label="low")
pyplot.legend()
pyplot.xlabel("distance related to TSS bp")
pyplot.ylabel("tag density")
pyplot.title("%s enrichment around TSS in different expression gene groups" %sortedbamfile)
pyplot.show()
view raw TSS_profile.py hosted with ❤ by GitHub
it generates figure below.
2.  use the ngsplot package 
follow the installation instruction.
at command line
ngs.plot.r -G hg19 -R genebody -C  config.k562.txt -O K562.H3k4me3.genebody -T H3k4me3.genebody -L 3000 -FL 300
# this is different from the plotting in method 1, I am plotting across the gene body,
# you can also use -R tss  to produce the same plot above.
# the bam file must be sorted by samtools first
the config.k562.txt file is like this: tab delimited 
K562H3k4me3.sorted.bam          top30_percent.txt           "High" 
K562H3k4me3.sorted.bam           medium30_percent.txt   "med" 
K562H3k4me3.sorted.bam           low30_percent.txt           "low"
it generates a tss plot and a heat map 
===============================
update on 11/19/13 
see this post on biostar http://www.biostars.org/p/83800/ it is very slow though, I tried it on my desktop (~4Gb Ram one core), it took some time to finish.
# From http://www.biostars.org/p/83800/:
# "What I want to do is to plot reads of my histone marks (in bam file)
# around TSS with CpG and TSS without CpG (Essentially a coverage profile)."
#
#
# To install metaseq and dependencies, see:
#
#
# https://pythonhosted.org/metaseq/install.html
#
#
# To download the example data used here, make sure you're in the directory
# this script is saved in, and then use:
#
# git clone https://gist.github.com/a2e63a2fb93d05341de5.git demo_data
#
# (Or see https://gist.github.com/daler/a2e63a2fb93d05341de5 and download the
# files individually)
#
import metaseq
import pybedtools
import numpy as np
from matplotlib import pyplot as plt
bam = metaseq.genomic_signal('demo_data/h3k4me3-chr21.bam', 'bam')
cpg = pybedtools.BedTool('demo_data/cpg-chr21.bed.gz')
tss = pybedtools.BedTool('demo_data/tss-chr21.bed.gz')
# extend by 5 kb up/downstream
tss = tss.slop(b=5000, genome='hg19')
tss_with_cpg = tss.intersect(cpg, u=True)
tss_without_cpg = tss.intersect(cpg, v=True)
# change this to as many CPUs as you have in order to run in parallel
processes = 1
# each read will be extended 3' to a total size of this many bp
fragment_size = 200
# the region +/-5kb around each TSS will be split into a total of 100 bins,
# change as needed
bins = 100
x = np.linspace(-5000, 5000, bins)
# most of the work happens here
y1 = bam.array(tss_with_cpg, bins=bins, processes=processes, fragment_size=fragment_size)
y2 = bam.array(tss_without_cpg, bins=bins, processes=processes, fragment_size=fragment_size)
plt.rcParams['font.size'] = 11
plt.rcParams['font.family'] = 'Arial'
plt.plot(x, y1.mean(axis=0), label='with cpg', color='k')
plt.plot(x, y2.mean(axis=0), label='without cpg', color='r', linestyle='--')
plt.legend(loc='best')
plt.xlabel('Distance from TSS (bp)')
plt.ylabel('Mean H3K4me3 read density')
plt.show()
view raw metaseq_demo.py hosted with ❤ by GitHub

14 comments:

  1. if you are not familiar with linux and python.
    you can try Seqmonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/Help/3%20Visualisation/3.2%20Figures%20and%20Graphs/3.2.6%20The%20Probe%20Trend%20Plot.html
    or Seqminer http://bips.u-strasbg.fr/seqminer/tiki-index.php?page=Clusters+heatmap&structure=seqMINER+Wiki

    They can produce similar graphs.

    ReplyDelete
  2. This tool from shirley liu's lab at harvard can make similar plots:
    http://liulab.dfci.harvard.edu/CEAS/

    ReplyDelete
    Replies
    1. For the CEAS package, a .bed and .wig file is required. What is the best way to convert a .bam into the .bed and .wig that CEAAS will accept? I have found several scripts that do the conversion into .wig, but I am not sure the proper way to make both so that CEAS will accept it.

      Thank you very much!

      Delete
    2. Hi, just a quick google
      use the bedtool http://seqanswers.com/forums/showthread.php?t=7638

      a post from Biostar is also very informative
      http://www.biostars.org/p/2699/

      Delete
  3. Is it possible to create a similar picture without the bam files? I got a file with chromosome, start, end, distance, score. What is the best way to approach?

    ReplyDelete
    Replies
    1. Is it a bed peak file generated by MACS or other peak calling software?

      Delete
    2. Sorry for the delayed responds. I got MACS output file, with chromosome, start, end, tags ect. is it possible to use your Python script?

      Delete
    3. MACS outputs the bed file and that's the peak file, you can not use this script for this purpose. you need to use the raw read file bam file or the wiggle file.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Hii Tommy Tang

    Can you please provide a python script for plotting the RNQ Seq data to find differentially expressed genes. Thank you !!

    ReplyDelete
  6. I really need these resource, are they all free for using? CP

    ReplyDelete
  7. Great jobs !
    Now I know how to draw that kind of figures by myself .
    Thank you very much.

    ReplyDelete
  8. All thanks to Mr Anderson for helping with my profits and making my fifth withdrawal possible. I'm here to share an amazing life changing opportunity with you. its called Bitcoin / Forex trading options. it is a highly lucrative business which can earn you as much as $2,570 in a week from an initial investment of just $200. I am living proof of this great business opportunity. If anyone is interested in trading on bitcoin or any cryptocurrency and want a successful trade without losing notify Mr Anderson now.Whatsapp: (+447883246472 )
    Email: tdameritrade077@gmail.com

    ReplyDelete