This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Tuesday, April 30, 2013

make an average conservation plot based on ChIP-seq data

I asked this question at Seqanswer
http://seqanswers.com/forums/showthread.php?t=29535

save it here for later reference. Thanks

How to make an average conservation plot from ChIP-seq data

Hi all,

I want to know how to make an average conservation plot
http://ceas.cbi.pku.edu.cn/ this website can make something I want, but it can only plot
a set of regions at one time, I want to plot two sets of regions, and compare them.

" 2. GC content and evolutionary conservation of each ChIP-region and their
average. CEAS uses PhastCons conservation scores from UCSC Genome
Bioinformatics, which is based on multiz alignment of human, chimp, mouse,
rat, dog, chicken, fugu, and zebrafish genomic DNA. CEAS generates thumbnail
conservation plot for each ChIP-region and the average conservation plot for
all the ChIP-regions, which can be directly used in ChIP-chip biologists'
manuscript.
"

Any python scripts or bioconductor package can do it?

Thanks

04-22-2013, 02:05 AM	#2
paolo.kunder Member Location: Milano, Italy Join Date: Aug 2011 Posts: 56	Hi, I encountered the same problem few weeks ago, It seems that there are not many "expert" regarding PhastCons in this forum. I explain you what I did, I took my Chip-seq regions, in bed file format, intersect them with PhastCons element in (mouse example) Code: http://hgdownload-test.cse.ucsc.edu/goldenPath/mm9/phastCons30way/ and calculated for each genomic position the score of conservation. Then averaged all the score. Consider that you will need to transform a bit the PhastCons file format, for example: original format ( just replaced some words with empty) Code: chrom=chr1 start=3000306 0.006 0.010 0.014 chrom=chrX start=40000306 0.014 chrom=chr9 start=80000306 0.1 0.2 processed format Code: chr1 3000306 3000307 0.006 chr1 3000307 3000308 0.010 chr1 3000308 3000309 0.019 chrX 40000306 40000307 0.014 chr9 80000306 80000307 0.1 chr9 80000307 80000308 0.2 with the following script: Code: awk '/^chrom/{split($1,a,"=");split($2,b,"=");next} { printf "%s\t%10d\t%10d\t%f\n",a[2],b[2],b[2]+1,$1;b[2]++}' filename Another possibility that I am investigating is using circos HTML Code: http://circos.ca/ but You will need to study first how the software works, Cheers, Paolo

batch converting coordinates to sequences

If you have a file which contains the coordinates of the chromosomes ( let's say you have a bed https://genome.ucsc.edu/FAQ/FAQformat.html#format1 file from a
ChIP-seq peak call) , how can you get the sequences of corresponding coordinates? ( you may need the sequences for motif analysis by MEME http://meme.nbcr.net/meme/intro.html )

1. a simple way is using the table browser in UCSC genome browser

Extracting sequence in batch from an assembly


	Question: "I have a lot of coordinates for an assembly and want to extract the corresponding sequences. What is the best way to proceed? Response: There are two ways to extract genomic sequence in batch from an assembly: A. Download the appropriate fasta files from our ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently. Sequence data for most assemblies is located in the assembly's "chromosomes" subdirectory on the downloads server. For example, the sequence for human assembly hg17 can be found inftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/. You'll find instructions for obtaining our source programs and utilities here. Some programs that you may find useful are nibFrag and twoBitToFa, as well as other fa* programs. To obtain usage information about most programs, execute it without arguments. B. Use the Table browser to extract sequence. This is a convenient way to obtain small amounts of sequence. Create a custom track of the genomic coordinates in BED format and upload into the Genome Browser. Select the custom track in the Table browser, then select the "sequence" output format to retrieve data. We recommend that you save the file locally as gzip.

Question:
"I have a lot of coordinates for an assembly and want to extract the corresponding sequences. What is the best way to proceed?

Response:

There are two ways to extract genomic sequence in batch from an assembly:

A. Download the appropriate fasta files from our ftp server and extract sequence data using your own tools or the tools from our source tree. This is the recommended method when you have very large sequence datasets or will be extracting data frequently. Sequence data for most assemblies is located in the assembly's "chromosomes" subdirectory on the downloads server. For example, the sequence for human assembly hg17 can be found inftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/. You'll find instructions for obtaining our source programs and utilities here. Some programs that you may find useful are nibFrag and twoBitToFa, as well as other fa* programs. To obtain usage information about most programs, execute it without arguments.

B. Use the Table browser to extract sequence. This is a convenient way to obtain small amounts of sequence.

Create a custom track of the genomic coordinates in BED format and upload into the Genome Browser.
Select the custom track in the Table browser, then select the "sequence" output format to retrieve data. We recommend that you save the file locally as gzip.

Use bedtools

http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html

http://www.biostars.org/p/19267/

$ bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> -fo <output FASTA>

but you will need to download the whole genome sequences in a fasta format.

3. another option will be using python package pygr https://code.google.com/p/pygr/

>>> from pygr import worldbase
>>> worldbase.dir('Bio.Seq.Genome.HUMAN')
['Bio.Seq.Genome.HUMAN.hg17', 'Bio.Seq.Genome.HUMAN.hg18', 'Bio.Seq.Genome.HUMAN.hg19']

>>> hg17 = worldbase.Bio.Seq.Genome.HUMAN.hg17()

>>> len(hg17)
46
>>> hg17.keys()
['chr6_random', 'chr19_random', 'chr8_random', 'chrY', 'chrX', 'chr13', 'chr12', 'chr11', 'chr15_random', 'chr17', 'chr16', 'chr15', 'chr14', 'chr19', 'chr18', 'chrM', 'chr1_random', 'chr13_random', 'chr3_random', 'chr6_hla_hap2', 'chr9_random', 'chr22_random', 'chr10', 'chr4_random', 'chr18_random', 'chr2_random', 'chr22', 'chr20', 'chr21', 'chr10_random', 'chr6_hla_hap1', 'chr7', 'chr6', 'chr5', 'chr4', 'chr3', 'chr2', 'chr1', 'chr7_random', 'chrX_random', 'chr9', 'chr8', 'chr16_random', 'chr5_random', 'chr17_random', 'chr12_random']
>>> chr1 = hg17['chr1']
>>> len(chr1)
245522847
>>> s = chr1[100000000:100001000]
>>> len(s)
1000
>>> repr(s)
'chr1[100000000:100001000]'
>>> print s
ATTCAGGCAATGTTGACTTCATTAGTGTTGACAGAAAATAAGCATAAAAATGCAAAACATTGTTGGCTTAACCTGAATATACCTGCATTACCTATTATTAACCAGTTTGTGTTGCCAAAAGACTTATTCCTTGGCATTAAAATGGAGCACTTAAAAATATTTCTAAAAAGCAAATGCCCACACGCTGGTCTTTGCAGCAAATAAGGGTATTTATACTTTTAAAATATTTTAAGTCCATAATTGGATTAATATACACACCTTCTTATGTATAAGGAGTTCAGATCATATAAACACCGTACAATCCAAAAAACCCTACTGAGAATAAAACTAAATAGGCTTATGATAAGAAATACAGATATTCCCATGTATTTACAAATATCATAGACACACAAATTTGGTCAAATACTGTAAAGAAAGAAGAAGtacctgtacctctacctctacctctacctcctctacctcctctacctcctctacctcctctacctcctctacctcctcttcctctacctctacctctacctctacctctacccacggtctccctttccctctctttccacggtctccctctgatgccgagccgaagctggactgtactgctgccatctcggctcactgcaacctccctgcctgattctcctgcctcaacctgccgagtgcctgcgattgcaggcgcgcgccgccacgcctgactggttttcgtatttttttggtggagacggggtttcgctgtgttggccgggctggtctccagctcctaacctcgagtgatccgccagcctcggcctcccgaggtgccgggtttgcagaggagtctcattcactcagtgctcaatggtgcccaggctggagtgcagtggcgtgatctcggctcgctacaacctccacctcccagcagcctgccttggcctcccaaagtgccgagattgcagtctccgcccggctgccaccccatctgggaagtgaggagcatctctgcctggccgcccatcgtctg

you can easily pull out the sequences based on chromosome name, start, end.

" Again, note that all of these operations were more or less instantaneous (even over the network). These Pygr objects (chr1, s) are just abstract representations of the specified pieces of the human genome, so creating these objectsdoes not require Pygr to download all of human chromosome 1. Only when you force it to show a sequence string does Pygr obtain that data over the network – and then only the piece that you need."

Tuesday, April 23, 2013

slow google chromium in ubuntu solved

My google chromium browser in my ubuntu 12.04 LTS frequently dies, and I have to restart my computer.

google is the best teacher :)
http://falcon1986.wordpress.com/2010/06/01/how-to-speed-up-google-chrome-on-ubuntu/

[HOW-TO] Speed up Google Chrome on Ubuntu

with 32 comments

Although I use Mozilla Firefox more than I use Google Chrome, the latter does come in handy when testing how websites display on a different browser or when there is need to diagnose internet connectivity problems. On Windows, Google Chrome is fast; perhaps even faster than Mozilla Firefox. However, on Ubuntu 9.10, I have noticed that it can be a little slow to initiate or even load a simple web page. In an effort to find a solution to this problem I stumbled across quite a few user-submitted posts that described a simple fix. After applying the fix, there was a significant improvement. Read on to find out how you can speed up Google Chrome on Ubuntu.
Apparently, the slow response of Google Chrome on Ubuntu has to do with the slow DNS lookup of this browser. These instructions should not be limited to Chrome on Ubuntu 9.10 alone and should work on any version of Ubuntu where the latest version of Chrome is installed.

Close any instance of Google Chrome that might be running.
Open a terminal window by going to Applications > Accessories > Terminal.
Enter the following command and press ENTER. It will open nsswitch.conf in gedit.
sudo gedit /etc/nsswitch.conf
Locate the following line within the file:
hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4
Replace that line with the following, save the file and exit gedit and the terminal.
hosts: files dns
Start up Google Chrome and experience a faster browsing experience!

----------------------------------------------------------------------------------------------------------

it worked for me.

Monday, April 22, 2013

awk for simple text manipulation

let's say you have a bed file (tab delimited):

tommy@tommy-ThinkPad-T420:~$ cat file.bed
chr1   100 302
chr2   600 901
chr3   250 383

you want to calculate the average peak length of this file:

tommy@tommy-ThinkPad-T420:~$ cat file.bed | awk '{print $3-$2}'| awk '{sum+=$0} END {print "Average= " sum/NR}'
Average= 212

if you calculate the middle point of the peak, $2+ ($3- $2)/2 and you get a float, you want to round the column:

cat file.bed | awk '{print $2+($3-$2)/2}'

201

750.5

316.5

tommy@tommy-ThinkPad-T420:~$ cat file.bed | awk '{print $2+($3-$2)/2}'| awk 'function round (A) { return int(A+0.5)} { printf("%d\n", round($0))}'

201

751

317

if you want to add an artificial column with peak1, peak2, peak3.....

cat file.bed | awk '{print $1"\t"$2"\t"$3"\t""peak"NR}'

chr1 100 302 peak1

chr2 600 901 peak2

chr3 250 383 peak3

if you want to change a space delimited file to a tab delimited file

cat foo.txt | awk ' -F'' { print $1"\t"$2"\t"$3 } > newfile.txt

you have two files, you want to subtract the third column of file1 (total 3 columns) from the second column of file2(total 3 columns)

paste foo1.txt foo2.txt | awk ' { print $3 -$5}'

you want to cut column 3 from file2, cut column 1 from file 1 and put them together:

paste <(cut -f3 foo2.txt) <(cut -f1 foo1.txt)

Saturday, April 20, 2013

Liftover between different Genome assemblies (hg18 to hg19)

see a post here:

http://manuelcorpas.com/2011/02/02/838/

Given the huge response I have at work about remapping features into another assembly, I present here an adapted version for how to remap a feature from NCBI36/hg18 to GRCh37/hg19 using UCSC’s liftOver tool.

Important:

Please make sure you know in advance the assembly to which your aberration data is currently mapped to. If by mistake you remap an aberration already in GRCh37 to GRCh37 you will get new coordinates for the region mapped to the wrong coordinates.
UCSC’s Genome Browser provides a web facility to convert coordinates from one assembly into another. To convert coordinates using their liftOver tool do the following:

Make sure that your data is in BED format, e.g. “chr3 100000 999990 myPatientId0000123” –> aberration in NCBI36/hg18
Note that each field is separated by a tab and each line by a character return. Please follow this strictly or the remapping tool may throw an error.
Add as many lines as aberrations you would like to remap.
Go to the liftOver page
Select “Original Assembly” Mar. 2006 (NCBI36/hg18) and “New Assembly” Feb. 2009 (GRCh37/hg19)
Leave all other parameters (Minimum ratio of bases that must remap, etc) with default values
Paste your aberration in the input box where it says “Paste in data” and hit submit
To get results, scroll down the page and click on the “View Conversions” link.
Here is the result I get:

chr3  125000      1024990     myPatientId0000123

Please note that your feature may not remap because the region is partially or entirely deleted in the new assembly or split in GRCh37. In this case I recommend that you use another start or end point position, maybe use the start/end of alternative probes until you find a region where it maps. Another possibility would be to look at the genes for the region in the old assembly and select a region in GRCh37 that includes the same genes as in NCBI36. Each of these solutions require careful deliberation and may not be applicable to your particular case (e.g. genes in different chromosomes would not allow remapping based on genes).

I hope this is helpful.

Follow Manuel Corpas on Twitter

Tips for Remapping from NCBI36 to GRCh37 Genome Assembly

July 28, 2009 § 5 Comments

It might seem for some people straight forward but I had to spend quite some time trying to understand how to remap my array probes from ncbi36 to CGRCh37. If you use the Ensembl genome browser, you might have noticed that from July 2009 the ncbi37 assembly is now in use. For DECIPHER (the database I help develop), this is a little bit of a headache, because it means that all of the probes from array CGH that we used have to be remapped to the new assembly. If this does not interest you I recommend that you stop reading here.

First I learned that there is a program called liftOver by UCSC that is able to do this remapping. Since the amount of probes I have to map (around 6 million) is a number that I would not wish to through to anyone’s server, I decided to do this in-house. You can download this program from here. I did not know which was the right binary for me to download, as they had linux32 and linux64 versions. I decided to go for the former, since I am using debian and it sounds like a conservative option.

Once I downloaded the program, I needed to make it executable:

chmod u+x liftOver

OK, so I was in a position to run it:

./liftOver

In the usage information it appears that I need several arguments and files to be able to run this program correctly:

liftOver oldFile map.chain newFile unMapped

Now I learned that I need also to get a file called the map.chain. I was not sure what it meant. I learned that this map.chain file has parameters that are used by liftOver and that there are map.chain files depending on the remapping one wants to do. In my case, I want to remap from ncbi36 to GRCh37 in human. However, when I look at the different remappings, I do not see ncbi formats anywhere. I learned here that what I am looking for is map chain file that is called this:

hg18toHg19.over.chain

Apparently hg18 refers to ncbi36 and hg19 to ncbi37. Doing a google search I could find that file here.

Now I get quite a few options and learn that I need to have my probes in bed format to run liftOver. Apparently there are quite a few formats I can use according to UCSC FAQs formats. Here an example of what my bed file looks like (chromosome-tab-start_position-tab-end_position):

chrY       12308579        12468100
chrY       12468101        12581699
chrY       12581700        12759636
chrY       12759637        12838587

Now I am in a position to run liftOver. I notice now that in the usage one has the following description:

liftOver oldFile map.chain newFile unMapped

‘newFile’ and ‘unMapped’ are the names of the files where the output goes into and therefore are empty. This can be confusing as the user might think that these are some other kind of files one has to get hold of.

OK, so now I am ready to transform our old array probe mapping ncbi36 to the new ncbi37 one:

./liftOver probes.ncbi36 hg18toHg19.over.chain probes.grch37 unmapped-to-grch37

I got the following output to console:

Reading liftover chains
Mapping coordinates
ERROR: start coordinate is after end coordinate (chromStart > chromEnd) on line 5171240 of bed file probes.decipher.ncbi36
ERROR: 4 2515512 2515453

…which is a bit worrying.

I’ve gone through my probes and found that some of them (just 44757!) had start point coordinates greater than their ends. I guess that if you encounter those you’ll have to decide what to do. For the time being I just took them out and re run liftOver again.

This time it worked.

--------------------------------------------------------------------------------------------------------------------------------------------------------

You can find the liftover files here:

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/liftOver/

web based:

http://genome.ucsc.edu/cgi-bin/hgLiftOver

Galaxy https://main.g2.bx.psu.edu/ also has a function tool for that purpose.

NCBI:
http://www.ncbi.nlm.nih.gov/genome/tools/remap/

Tuesday, April 16, 2013

Statistical measure of biological significance for overlapping genomic regions

updated 06/25/13 look at this python package, it answered this question

http://www.cgat.org/~andreas/documentation/gat/

the paper published in bioinformatics

GAT: a simulation framework for testing the association of genomic intervals

--------------------------------------------------------------------------------------------------------

I had the same question as the title of this blog.

simply google I found here:
http://www.biostars.org/p/8480/
and here:
http://www.biostars.org/p/5484/

https://github.com/brentp/bio-playground/blob/master/utils/list_overlap_p.py
the python script for the hypergeometric test


import optparse

import sys

import os.path as op

import scipy.stats as ss



def hypergeom(m, n, n1, n2, p=False):

    """

    >>> hypergeom(1, 1000, 1000, 1000) # has to be shared.

    1.0



    >>> all(hypergeom(i, 1000, 1000, 1000) == 1.0 for i in range(100))

    True



    >>> hypergeom(1, 30000, 20, 20)

    0.013253396616299651



    >>> hypergeom(2, 30000, 20, 20)

    7.9649366037104485e-05



    >>> hypergeom(11, 30000, 20, 20)

    4.516176321800458e-11



    >>> hypergeom(10, 30000, 20, 20) # very low prob.

    4.516176321800458e-11



    >>> hypergeom(20, 30000, 20, 20) # very low chance that all are shared.

    4.516176321800458e-11



    """

    if m <= 0: return 1.0

    mmin = m - 1

    mmax = min(n1, n2)

    return ss.hypergeom.cdf(mmax, n, n1, n2) - ss.hypergeom.cdf(mmin, n, n1, n2)







"Also, checkout randomstats in pybedtools by Ryan Dale." 







R package ChIPpeakAnno can do similar job

http://www.bioconductor.org/packages/2.11/bioc/html/ChIPpeakAnno.html





R code:

library(ChIPpeakAnno)
TF1.bed<-  read.table(file="TF1.txt", header=FALSE)
TF1.rangedData=BED2RangedData(TF1.bed)

TF2.bed<- read.table(file="TF2.txt", header=FALSE)
TF2.rangedData=BED2RangedData(TF2.bed)

t1<- findOverlappingPeaks(TF1.rangedData, TF2.rangedData, maxgap=200, NameOfPeaks1="TF1", NameOfPeaks="TF2", select="all")
makeVennDiagram(RangedDataList(TF1.rangedData, TF2.rangedData), NameOfPeaks=c("TF1", "TF2"), maxgap=0, totalTest=50000, cex=1, counts.col="red",useFeature=FALSE)




here a discussion about the choose of number of totaltest 

https://stat.ethz.ch/pipermail/bioconductor/2010-November/036493.html







"It would be helpful to describe your problem and post to the whole message board.  (There are many experts who probably can be more helpful than myself :-))  That said, I think you are referring to the "NaN" error and below are my thoughts (Julie Zhu also answered this a couple of times and her reply is probably in the archives).
When calling the makeVennDiagram function you want to set the totalTest number to something that is larger than the experimentally determined peak number.  As far as I know, the totalTest number is used for the hypergeometric sampling that is used to determine if the overlap between two datasets is more than would be expected by chance.  So one way to sort this out using biological information is to think about the maximum number of possible binding events and use that as the totalTest number.  For example, if you are studying a sequence-specific DNA binding protein with a known motif you could count that number of times that motif occurs in the genome and compare that to the number of peaks you have experimentally determined.

Motifs = 500
Peaks = 200
Peaks w/ motif = 180 (90%)
"upper limit"    = 500
new "upper limit" for totalTest = .9 x 500 = 450

Now if your working with a sequence-independent binding factor it can get tricky.  One approach would be to determine the mean peak width.  Then divide the whole genome sequence by this number to get an upper limit.  This is probably way to high so using additional information such as if the protein binds intergenic or ORFs could bring the number down but make it more relevant to the biological experiment.  For example:

peaks   = 75
intergenic peaks = 70
ORF peaks  = 5
mean peak width = 50 base pairs
genome size  = 10000 base pairs
"upper limit"   = 10000/ 50 = 200 (possible peaks)
intergenic seq  = 4000 base pairs
new "upper limit" =  4000/50 = 80 (possible intergenic peaks)

I was working with something more like the second case and I felt the totalTest based on the total genome was quite relaxed and based on the intergenic sequence only was quite stringent so somewhere in the middle might be better but most importantly I feel I am standing on some solid biological reasoning  for determining the amount of sampling.

Hope this helps and I would be interested to here if anybody has some critiques of this approach or additional suggestions.

Best,

Noah


"

Monday, April 8, 2013

custom sort in python, the built-in sorted()

let's say you have a list, and you want to sort it. you can use the build in function sorted()

Help on built-in function sorted in module __builtin__:

sorted(...)
sorted(iterable, cmp=None, key=None, reverse=False) --> new sorted list
(END)

it takes any iterable as input.

for example:

>>> a=[4,2,1,6]

>>> sorted(a)

[1, 2, 4, 6]

>>> sorted(a, reverse=True)

[6, 4, 2, 1]

>>> a=['ccc','aaaa','d','bb']

>>> sorted(a)

['aaaa', 'bb', 'ccc', 'd']

if you want to sort according to the length of the string, pass a len function as the argument

>>> sorted(a, key=len)

['d', 'bb', 'ccc', 'aaaa']

>>> a=['ccc','aaaz','a','bb']

you want to sort according to the last letter of the string

>>> def last(s):

... return s[-1]

...

>>> sorted(a, key=last)

['a', 'bb', 'ccc', 'aaaz']

let's say you have a list of tuples, the sorted function sorts according to the first element

of the tuple, if they are the same, compare the second element of the tuple.

>>> a=[(1,'b'),(2,'a'),(1,'a')]

>>> sorted(a)

[(1, 'a'), (1, 'b'), (2, 'a')]

so, if you want to sort by something first and then sort by another thing next, you can write

a key function to return a tuple.

Saturday, April 6, 2013

python scripts for real bioinformatic problems

https://github.com/chapmanb
practical useful python scripts for NGS analysis etc.

http://gettinggeneticsdone.blogspot.com/
follow this blog

How to make TSS plot using RNA-seq and ChIP-seq data

many times, we want to plot the ChIP-seq signal across the TSS in a genome wide scale.

I figured out how to do it by several ways. Using Encode K562 cells H3K4Me3 ChIP-seq data as an example

1. Using HTSeq python package

http://www-huber.embl.de/users/anders/HTSeq/doc/tss.html#tss

1) Download the H3K4me3 ChIP-seq and RNA-seq data sets.

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/

 wgEncodeCaltechRnaSeqK562R1x75dAlignsRep1V2.bam                        26-Jul-2010 18:31  1.4G

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/

wgEncodeBroadHistoneK562H3k4me1StdAlnRep1.bam                   29-Oct-2010 11:03  919M

2) prepare the gtf file for hg19 version

http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

$ cat $HOME/.hg.conf
db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password

And set the permissions:
$ chmod 600 .hg.conf

Now you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file refGene.gtf:
$ genePredToGtf -utr hg19 refGene refGene.gtf

add -utr option to add 5'UTR information which contains the TSS position.

for RNA-seq data analysis we need the refGene.gtf (contains TSS and exons)

for ChIP-seq plotting at TSS, we need to extract TSSs from refGene.gtf

it can be done easily by linux command line

cat refGene.gtf | grep 5UTR > hg19.TSS.gtf

3) group the genes to high, medium, low expression by analyzing the RNA-seq data

#RNA-Seq analysis

#The following block is to sort the gene expression data into three groups based on level of expression.

#Use python to count the number of tags of genes obtained via RNA-Seq and write that information into a new output file.

bam file is converted to sam file by samtools, because the htseq count script only takes sam file as input.

http://samtools.sourceforge.net/

samtools view -h -o K562RNAseqRep1V2.sam K562RNAseqRep1V2.bam

python -m HTSeq.scripts.count -s no K562RNASeqRep1V2.sam hg19_UTR_exon.gtf > K562_htseq.count.out

#stranded=count of how many tags in each gene. Add up all the counts for each exon.

#Once we have the K562_htseq_count.out file, first cound the number of genes that are there.

cat K562_htseq_count.out | head -n -5 > K562_htseq_count.out.clean # get rid of the last five lines which are the summary of the count results

cat K562_htseq_count.out.clean | wc -l

#For the working file we had 23705 counts.

#So each group will have 1/3 of the total (7902 genes) and highest tag density will be the top 33% followed by mid 33% and low 33%.

#Now sort the results into three groups depending on tag density in Linux command line.

#Sort the file according to the second column and group into three groups.

cat K562_htseq.count.out.clean | sort -k2,2nr| head -7902 > top33_percent.txt

cat K562_htseq.count.out.clean | sort -k2,2nr| tail -15803 | head -7902 > mid33_percent.txt

cat K562_htseq.count.out.clean | sort -k2,2nr| tail -7902 > low33_percent.txt

4)python code:
update on 10/20/13, I put my code in gist and embed it here




it generates figure below.



2.  use the ngsplot package 

http://code.google.com/p/ngsplot/

follow the installation instruction.

at command line

ngs.plot.r -G hg19 -R genebody -C  config.k562.txt -O K562.H3k4me3.genebody -T H3k4me3.genebody -L 3000 -FL 300

# this is different from the plotting in method 1, I am plotting across the gene body,

# you can also use -R tss  to produce the same plot above.

# the bam file must be sorted by samtools first




the config.k562.txt file is like this: tab delimited 

K562H3k4me3.sorted.bam          top30_percent.txt           "High" 

K562H3k4me3.sorted.bam           medium30_percent.txt   "med" 

K562H3k4me3.sorted.bam           low30_percent.txt           "low"




it generates a tss plot and a heat map 
===============================

update on 11/19/13

see this post on biostar http://www.biostars.org/p/83800/ it is very slow though, I tried it on my desktop (~4Gb Ram one core), it took some time to finish.

Thursday, April 4, 2013

Must read blog

Cool use of Unix paste with NGS sequences:FASTQ to FASTA

http://thegenomefactory.blogspot.com.au/search?updated-max=2012-09-08T12:11:00%2B10:00&max-results=7

Cool use of Unix paste with NGS sequences

While browsing SeqAnswers.com today I came across a post where Uwe Appelt provided a couple of lines of Unix shell wizadry to solve some problem. What attracted my attention was the following:

paste - - - - < in.fq | filter | tr "\t" "\n" > out.fq

Now, I've done a reasonable amount of shell one-liners in my life, but I'd never seen this before. I've used the paste command a couple of times, but clearly its potential power did not sink in! Here is the man page description for paste:

Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output. With no FILE, or when FILE is -, read standard input.

So what's happening here? Well, in Unix, the "-" character means to use STDIN instead of a filename. Here Uwe is providing paste with four filenames, each of which is the same stdin filehandle. So lines 1..4 of input.fq are put onto one line (with tab separator), and lines 5..8 on the next line and so on. Now, our stream has the four lines of FASTQ entry on a single line, which makes it much more amenable to Unix line-based manipulation, represented by filter in my example. Once that's all done, we need to put it back into the standard 4-line FASTQ format, which is as simple as converting the tabs "\t" back to newlines "\n" with the tr command.

Example 1: FASTQ to FASTA

A common thing to do is convert FASTQ to FASTA, and we don't always have our favourite tool or script to to this when we aren't on our own servers:

paste - - - - < in.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > out.fa

paste converts the input FASTQ into a 4-column file
cut command extracts out just column 1 (the ID) and column 2 (the sequence)
sed replaces the FASTQ ID prefix "@" with the FASTA ID prefix ">"
tr conversts the 2 columns back into 2 lines

And because the shell command above uses a pipe connecting four commands (paste, cut, sed, tr) the operating system will run them all in parallel, which will make it run faster assuming your disk I/O can keep up.

Example 2: Removing redundant FASTQ ID in line 3

The third line in the FASTQ format is somewhat redundant - it is usually a duplicate of the first line, except with "+" instead of "@" to denote that a quality string is coming next rather than an ID. Most parsers ignore it, and happily accept a blank ID after the "+", which saves a fair chunk of disk space. If you have legacy files with the redundant IDs and want to conver them, here's how we can do it with our new paste trick:

paste -d ' ' - - - - | sed 's/ +[^ ]*/ +/' | tr " " "\n"

paste converts the input FASTQ into a 4-column file, but using SPACE instead of TAB as the separator character
sed finds and replaces the "+DUPE_ID" line with just a "+"
tr conversts the 4 columns back into 4 lines

That's it for today, hope you learnt something, because I certainly did.

Diving into Genetics and Genomics

My github papge

Tuesday, April 30, 2013

make an average conservation plot based on ChIP-seq data

batch converting coordinates to sequences

Tuesday, April 23, 2013

slow google chromium in ubuntu solved

[HOW-TO] Speed up Google Chrome on Ubuntu

Monday, April 22, 2013

awk for simple text manipulation

Saturday, April 20, 2013

Liftover between different Genome assemblies (hg18 to hg19)

Important:

Tips for Remapping from NCBI36 to GRCh37 Genome Assembly

Tuesday, April 16, 2013

Statistical measure of biological significance for overlapping genomic regions

GAT: a simulation framework for testing the association of genomic intervals

Monday, April 8, 2013

custom sort in python, the built-in sorted()

Saturday, April 6, 2013

python scripts for real bioinformatic problems

How to make TSS plot using RNA-seq and ChIP-seq data

Thursday, April 4, 2013

Must read blog

Cool use of Unix paste with NGS sequences:FASTQ to FASTA

http://thegenomefactory.blogspot.com.au/search?updated-max=2012-09-08T12:11:00%2B10:00&max-results=7

Cool use of Unix paste with NGS sequences

Labels

My Blog List