I figured out how to do it by several ways. Using Encode K562 cells H3K4Me3 ChIP-seq data as an example
1. Using HTSeq python package
1) Download the H3K4me3 ChIP-seq and RNA-seq data sets.
wgEncodeCaltechRnaSeqK562R1x75dAlignsRep1V2.bam 26-Jul-2010 18:31 1.4G
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/
wgEncodeBroadHistoneK562H3k4me1StdAlnRep1.bam 29-Oct-2010 11:03 919M
2) prepare the gtf file for hg19 version
http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format
$ cat $HOME/.hg.conf db.host=genome-mysql.cse.ucsc.edu db.user=genomep db.password=passwordAnd set the permissions:$ chmod 600 .hg.confNow you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file refGene.gtf:$ genePredToGtf -utr hg19 refGene refGene.gtfadd -utr option to add 5'UTR information which contains the TSS position.for RNA-seq data analysis we need the refGene.gtf (contains TSS and exons)for ChIP-seq plotting at TSS, we need to extract TSSs from refGene.gtfit can be done easily by linux command linecat refGene.gtf | grep 5UTR > hg19.TSS.gtf3) group the genes to high, medium, low expression by analyzing the RNA-seq data#RNA-Seq analysis#The following block is to sort the gene expression data into three groups based on level of expression.#Use python to count the number of tags of genes obtained via RNA-Seq and write that information into a new output file.bam file is converted to sam file by samtools, because the htseq count script only takes sam file as input.samtools view -h -o K562RNAseqRep1V2.sam K562RNAseqRep1V2.bampython -m HTSeq.scripts.count -s no K562RNASeqRep1V2.sam hg19_UTR_exon.gtf > K562_htseq.count.out#stranded=count of how many tags in each gene. Add up all the counts for each exon.#Once we have the K562_htseq_count.out file, first cound the number of genes that are there.cat K562_htseq_count.out | head -n -5 > K562_htseq_count.out.clean # get rid of the last five lines which are the summary of the count resultscat K562_htseq_count.out.clean | wc -l#For the working file we had 23705 counts.#So each group will have 1/3 of the total (7902 genes) and highest tag density will be the top 33% followed by mid 33% and low 33%.#Now sort the results into three groups depending on tag density in Linux command line.#Sort the file according to the second column and group into three groups.cat K562_htseq.count.out.clean | sort -k2,2nr| head -7902 > top33_percent.txtcat K562_htseq.count.out.clean | sort -k2,2nr| tail -15803 | head -7902 > mid33_percent.txtcat K562_htseq.count.out.clean | sort -k2,2nr| tail -7902 > low33_percent.txt4)python code:update on 10/20/13, I put my code in gist and embed it here
it generates figure below.This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
#! /usr/bin/env python # group the genes according to expression level # analyze RNAseq data by counting tags for each gene using HTSeq.scripts.count or use bedtools muticov # it genrates a file (K562_htseq_count.out.clean) with two columns, column 1 are gene names, column 2 are #counts that mapped to all the exons of the same gene. # compare the counts from different methods! and visualize them in IGV browser. # top 30% midum 30% and low 30% gene names were obtained by linux command line # sort -k2 -nrs K562_htseq_count.out.clean | wc -l # sort -k2 -nrs K562_htseq_count.out.clean | head -n7992 > top30_percent.txt # note that the counts are orignial raw data which need to be normalized, but for this simple test, the # counts are proportional to the expression levles especially in the same sample. # to compare gene expression from different samples, need to take account in the library size, the # depth of sequencing and the length of genes.(longer genes have more counts by chance) # to calculate the differential expression of genes, use the bioconductor(R) package DEseq which was written # by Simon Andrew, the same guy wrote the HTSeq!(Seqmonk is written by hime also...) or differential exon usage by DEXseq package. # samtools was used to prepare sorted and indexed bam file. bam file was converted to sam file by samtools also. # HTSeq.scripts.count only takes sam file as input. # 03/12/13 modified for shell use import sys import os Usage='''this program takes in five input files, sortedbamfile, GFF file and grouped gene lists. it generates the transcription factor binding or histone modification profiling around the TSS in a genome wide scale. How to use: in a shell type HTSeq_TSS4.py sorted.bam hg19.gtf top30_gene.txt medium30_gene.txt low30_gene.txt ''' if len(sys.argv) <2 : print Usage else: sortedbamfile=sys.argv[1] gtf_file=sys.argv[2] top30_genes=sys.argv[3] medium30_genes=sys.argv[4] low30_genes=sys.argv[5] def TSS_Profile(ifile1,ifile2,ifile3): '''read in three files, ifile1 is the sortedbamfile prepared by samtool ifile2 is the GFF file, ifile3 is the grouped gene list''' import sys import HTSeq import numpy import itertools sortedbamfile=HTSeq.BAM_Reader(ifile1) gtffile=HTSeq.GFF_Reader(ifile2) group_genes=open(ifile3) halfwinwidth=3000 fragmentsize=200 gtf_dict={} #make a dictionary, key is the gene_name, value is the iv.start_d_as_pos for feature in gtffile: gtf_dict[feature.name]=feature.iv.start_d_as_pos tsspos=set() #extract TSS for each group of genes for line in group_genes: linelist=line.split() try: tsspos.add(gtf_dict[linelist[0]]) except: continue #in case the key is not in the gtf file # highexpr=set() #for line in open(highexpression): # highexpr.add( line.split()[0] ) # for feature in gtf_file: # if feature.name in highexpr: # do_something() #As you can see, using a container to store the information from one file in memory #allows you to process your two files separately and so avoid the nested loop. This, # of course is a very general and basic design pattern that you will encounter very often. #(The different kinds of data containers that programming languages offer and how and when #to use them is probably the next thing you need to study to further your programming skills.) for tss in itertools.islice(tsspos,10): sys.stderr.write("first 10 transcription start sites in %s are %s \n" %(ifile3,tss)) profile=numpy.zeros(2*halfwinwidth, dtype='i') for p in tsspos: try: window=HTSeq.GenomicInterval(p.chrom, p.pos-halfwinwidth-fragmentsize,p.pos+halfwinwidth+fragmentsize,".") for almnt in sortedbamfile[window]: almnt.iv.length=fragmentsize if p.strand=="+": start_in_window=almnt.iv.start- p.pos +halfwinwidth end_in_window =almnt.iv.end - p.pos +halfwinwidth else: start_in_window=p.pos+halfwinwidth-almnt.iv.end end_in_window =p.pos+halfwinwidth-almnt.iv.start start_in_window=max(start_in_window,0) end_in_window=min(end_in_window, 2*halfwinwidth) if start_in_window >= 2*halfwinwidth or end_in_window <0: continue profile[start_in_window : end_in_window] +=1 except: continue return profile halfwinwidth=3000 profile1=TSS_Profile(sortedbamfile,gtf_file,top30_genes) profile2=TSS_Profile(sortedbamfile,gtf_file,medium30_genes) profile3=TSS_Profile(sortedbamfile,gtf_file,low30_genes) from matplotlib import pyplot import numpy line1=pyplot.plot(numpy.arange(-halfwinwidth, halfwinwidth), profile1, color="red",label="high") line2=pyplot.plot(numpy.arange(-halfwinwidth, halfwinwidth), profile2, color="blue",label="medium") line3=pyplot.plot(numpy.arange(-halfwinwidth, halfwinwidth), profile3, color="green",label="low") pyplot.legend() pyplot.xlabel("distance related to TSS bp") pyplot.ylabel("tag density") pyplot.title("%s enrichment around TSS in different expression gene groups" %sortedbamfile) pyplot.show() 2. use the ngsplot packagefollow the installation instruction.at command linengs.plot.r -G hg19 -R genebody -C config.k562.txt -O K562.H3k4me3.genebody -T H3k4me3.genebody -L 3000 -FL 300# this is different from the plotting in method 1, I am plotting across the gene body,# you can also use -R tss to produce the same plot above.# the bam file must be sorted by samtools firstthe config.k562.txt file is like this: tab delimitedK562H3k4me3.sorted.bam top30_percent.txt "High"K562H3k4me3.sorted.bam medium30_percent.txt "med"K562H3k4me3.sorted.bam low30_percent.txt "low"it generates a tss plot and a heat map===============================
update on 11/19/13see this post on biostar http://www.biostars.org/p/83800/ it is very slow though, I tried it on my desktop (~4Gb Ram one core), it took some time to finish.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# From http://www.biostars.org/p/83800/: | |
# "What I want to do is to plot reads of my histone marks (in bam file) | |
# around TSS with CpG and TSS without CpG (Essentially a coverage profile)." | |
# | |
# | |
# To install metaseq and dependencies, see: | |
# | |
# | |
# https://pythonhosted.org/metaseq/install.html | |
# | |
# | |
# To download the example data used here, make sure you're in the directory | |
# this script is saved in, and then use: | |
# | |
# git clone https://gist.github.com/a2e63a2fb93d05341de5.git demo_data | |
# | |
# (Or see https://gist.github.com/daler/a2e63a2fb93d05341de5 and download the | |
# files individually) | |
# | |
import metaseq | |
import pybedtools | |
import numpy as np | |
from matplotlib import pyplot as plt | |
bam = metaseq.genomic_signal('demo_data/h3k4me3-chr21.bam', 'bam') | |
cpg = pybedtools.BedTool('demo_data/cpg-chr21.bed.gz') | |
tss = pybedtools.BedTool('demo_data/tss-chr21.bed.gz') | |
# extend by 5 kb up/downstream | |
tss = tss.slop(b=5000, genome='hg19') | |
tss_with_cpg = tss.intersect(cpg, u=True) | |
tss_without_cpg = tss.intersect(cpg, v=True) | |
# change this to as many CPUs as you have in order to run in parallel | |
processes = 1 | |
# each read will be extended 3' to a total size of this many bp | |
fragment_size = 200 | |
# the region +/-5kb around each TSS will be split into a total of 100 bins, | |
# change as needed | |
bins = 100 | |
x = np.linspace(-5000, 5000, bins) | |
# most of the work happens here | |
y1 = bam.array(tss_with_cpg, bins=bins, processes=processes, fragment_size=fragment_size) | |
y2 = bam.array(tss_without_cpg, bins=bins, processes=processes, fragment_size=fragment_size) | |
plt.rcParams['font.size'] = 11 | |
plt.rcParams['font.family'] = 'Arial' | |
plt.plot(x, y1.mean(axis=0), label='with cpg', color='k') | |
plt.plot(x, y2.mean(axis=0), label='without cpg', color='r', linestyle='--') | |
plt.legend(loc='best') | |
plt.xlabel('Distance from TSS (bp)') | |
plt.ylabel('Mean H3K4me3 read density') | |
plt.show() |
if you are not familiar with linux and python.
ReplyDeleteyou can try Seqmonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/Help/3%20Visualisation/3.2%20Figures%20and%20Graphs/3.2.6%20The%20Probe%20Trend%20Plot.html
or Seqminer http://bips.u-strasbg.fr/seqminer/tiki-index.php?page=Clusters+heatmap&structure=seqMINER+Wiki
They can produce similar graphs.
This tool from shirley liu's lab at harvard can make similar plots:
ReplyDeletehttp://liulab.dfci.harvard.edu/CEAS/
For the CEAS package, a .bed and .wig file is required. What is the best way to convert a .bam into the .bed and .wig that CEAAS will accept? I have found several scripts that do the conversion into .wig, but I am not sure the proper way to make both so that CEAS will accept it.
DeleteThank you very much!
Hi, just a quick google
Deleteuse the bedtool http://seqanswers.com/forums/showthread.php?t=7638
a post from Biostar is also very informative
http://www.biostars.org/p/2699/
Is it possible to create a similar picture without the bam files? I got a file with chromosome, start, end, distance, score. What is the best way to approach?
ReplyDeleteIs it a bed peak file generated by MACS or other peak calling software?
DeleteSorry for the delayed responds. I got MACS output file, with chromosome, start, end, tags ect. is it possible to use your Python script?
DeleteMACS outputs the bed file and that's the peak file, you can not use this script for this purpose. you need to use the raw read file bam file or the wiggle file.
DeleteThis comment has been removed by the author.
ReplyDeleteHii Tommy Tang
ReplyDeleteCan you please provide a python script for plotting the RNQ Seq data to find differentially expressed genes. Thank you !!
I really need these resource, are they all free for using? CP
ReplyDeleteYes, they are all free.
DeleteGreat jobs !
ReplyDeleteNow I know how to draw that kind of figures by myself .
Thank you very much.
All thanks to Mr Anderson for helping with my profits and making my fifth withdrawal possible. I'm here to share an amazing life changing opportunity with you. its called Bitcoin / Forex trading options. it is a highly lucrative business which can earn you as much as $2,570 in a week from an initial investment of just $200. I am living proof of this great business opportunity. If anyone is interested in trading on bitcoin or any cryptocurrency and want a successful trade without losing notify Mr Anderson now.Whatsapp: (+447883246472 )
ReplyDeleteEmail: tdameritrade077@gmail.com