I figured out how to do it by several ways. Using Encode K562 cells H3K4Me3 ChIP-seq data as an example
1. Using HTSeq python package
1) Download the H3K4me3 ChIP-seq and RNA-seq data sets.
wgEncodeCaltechRnaSeqK562R1x75dAlignsRep1V2.bam 26-Jul-2010 18:31 1.4G
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/
wgEncodeBroadHistoneK562H3k4me1StdAlnRep1.bam 29-Oct-2010 11:03 919M
2) prepare the gtf file for hg19 version
http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format
$ cat $HOME/.hg.conf db.host=genome-mysql.cse.ucsc.edu db.user=genomep db.password=passwordAnd set the permissions:$ chmod 600 .hg.confNow you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file refGene.gtf:$ genePredToGtf -utr hg19 refGene refGene.gtfadd -utr option to add 5'UTR information which contains the TSS position.for RNA-seq data analysis we need the refGene.gtf (contains TSS and exons)for ChIP-seq plotting at TSS, we need to extract TSSs from refGene.gtfit can be done easily by linux command linecat refGene.gtf | grep 5UTR > hg19.TSS.gtf3) group the genes to high, medium, low expression by analyzing the RNA-seq data#RNA-Seq analysis#The following block is to sort the gene expression data into three groups based on level of expression.#Use python to count the number of tags of genes obtained via RNA-Seq and write that information into a new output file.bam file is converted to sam file by samtools, because the htseq count script only takes sam file as input.samtools view -h -o K562RNAseqRep1V2.sam K562RNAseqRep1V2.bampython -m HTSeq.scripts.count -s no K562RNASeqRep1V2.sam hg19_UTR_exon.gtf > K562_htseq.count.out#stranded=count of how many tags in each gene. Add up all the counts for each exon.#Once we have the K562_htseq_count.out file, first cound the number of genes that are there.cat K562_htseq_count.out | head -n -5 > K562_htseq_count.out.clean # get rid of the last five lines which are the summary of the count resultscat K562_htseq_count.out.clean | wc -l#For the working file we had 23705 counts.#So each group will have 1/3 of the total (7902 genes) and highest tag density will be the top 33% followed by mid 33% and low 33%.#Now sort the results into three groups depending on tag density in Linux command line.#Sort the file according to the second column and group into three groups.cat K562_htseq.count.out.clean | sort -k2,2nr| head -7902 > top33_percent.txtcat K562_htseq.count.out.clean | sort -k2,2nr| tail -15803 | head -7902 > mid33_percent.txtcat K562_htseq.count.out.clean | sort -k2,2nr| tail -7902 > low33_percent.txt4)python code:update on 10/20/13, I put my code in gist and embed it here
it generates figure below.2. use the ngsplot packagefollow the installation instruction.at command linengs.plot.r -G hg19 -R genebody -C config.k562.txt -O K562.H3k4me3.genebody -T H3k4me3.genebody -L 3000 -FL 300# this is different from the plotting in method 1, I am plotting across the gene body,# you can also use -R tss to produce the same plot above.# the bam file must be sorted by samtools firstthe config.k562.txt file is like this: tab delimitedK562H3k4me3.sorted.bam top30_percent.txt "High"K562H3k4me3.sorted.bam medium30_percent.txt "med"K562H3k4me3.sorted.bam low30_percent.txt "low"it generates a tss plot and a heat map===============================
update on 11/19/13see this post on biostar http://www.biostars.org/p/83800/ it is very slow though, I tried it on my desktop (~4Gb Ram one core), it took some time to finish.
if you are not familiar with linux and python.
ReplyDeleteyou can try Seqmonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/Help/3%20Visualisation/3.2%20Figures%20and%20Graphs/3.2.6%20The%20Probe%20Trend%20Plot.html
or Seqminer http://bips.u-strasbg.fr/seqminer/tiki-index.php?page=Clusters+heatmap&structure=seqMINER+Wiki
They can produce similar graphs.
This tool from shirley liu's lab at harvard can make similar plots:
ReplyDeletehttp://liulab.dfci.harvard.edu/CEAS/
For the CEAS package, a .bed and .wig file is required. What is the best way to convert a .bam into the .bed and .wig that CEAAS will accept? I have found several scripts that do the conversion into .wig, but I am not sure the proper way to make both so that CEAS will accept it.
DeleteThank you very much!
Hi, just a quick google
Deleteuse the bedtool http://seqanswers.com/forums/showthread.php?t=7638
a post from Biostar is also very informative
http://www.biostars.org/p/2699/
Is it possible to create a similar picture without the bam files? I got a file with chromosome, start, end, distance, score. What is the best way to approach?
ReplyDeleteIs it a bed peak file generated by MACS or other peak calling software?
DeleteSorry for the delayed responds. I got MACS output file, with chromosome, start, end, tags ect. is it possible to use your Python script?
DeleteMACS outputs the bed file and that's the peak file, you can not use this script for this purpose. you need to use the raw read file bam file or the wiggle file.
DeleteThis comment has been removed by the author.
ReplyDeleteHii Tommy Tang
ReplyDeleteCan you please provide a python script for plotting the RNQ Seq data to find differentially expressed genes. Thank you !!
I really need these resource, are they all free for using? CP
ReplyDeleteYes, they are all free.
DeleteGreat jobs !
ReplyDeleteNow I know how to draw that kind of figures by myself .
Thank you very much.
All thanks to Mr Anderson for helping with my profits and making my fifth withdrawal possible. I'm here to share an amazing life changing opportunity with you. its called Bitcoin / Forex trading options. it is a highly lucrative business which can earn you as much as $2,570 in a week from an initial investment of just $200. I am living proof of this great business opportunity. If anyone is interested in trading on bitcoin or any cryptocurrency and want a successful trade without losing notify Mr Anderson now.Whatsapp: (+447883246472 )
ReplyDeleteEmail: tdameritrade077@gmail.com