Diving into Genetics and Genomics: October 2014

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday, October 30, 2014

Test significance of overlapping genes from 2 gene sets

The question is very straightforward:
I have three microarray or RNA-seq data sets (control, treatment1, treatment2, each should have several replicates, otherwise you can not get a pvalue for differential expression!) , I then get two sets of differentially up-regulated genes ( treatment1 vs control, treatment2 vs control). There are some overlapping genes which are both up-regulated by treatment1 and treatment2. How significant the overlapping is?

The easiest way is to use an online program http://nemates.org/MA/progs/overlap_stats.html
The program calculates the p value by using Hypergeometric_distribution.

see below for a python and an R script for this problem.

read posts here http://stats.stackexchange.com/questions/16247/calculating-the-probability-of-gene-list-overlap-between-an-rna-seq-and-a-chip-c
http://blog.nextgenetics.net/?e=94
https://www.biostars.org/p/90662/

list1=3000, list2=400, Total gene number (population)=15000,
overlapping between list1 and list2 =100

	#! /usr/bin/env python

	import sys
	import scipy.stats as stats

	#The result will be
	# a p-value where by random chance number of genes with both condition A and B will be <= to your number with condition A and B
	# a p-value where by random chance number of genes with both condition A and B will be >= to your number with condition A and B
	# The second p-value is probably what you want.

	if len(sys.argv) < 5:
	print '''use it by: python gene_sets_hypergeomtric_test.py [total number of genes in the list] [total number of genes in the list with condition A] [total number of genes in the list with condition B] [number of genes with both condition A and B]'''
	sys.exit()
	print
	print 'total number in population: ' + sys.argv[1]
	print 'total number with condition in population: ' + sys.argv[2]
	print 'number in subset: ' + sys.argv[3]
	print 'number with condition in subset: ' + sys.argv[4]
	print
	print 'p-value <= ' + sys.argv[4] + ': ' + str(stats.hypergeom.cdf(int(sys.argv[4]) + 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3])))
	print
	print 'p-value >= ' + sys.argv[4] + ': ' + str(stats.hypergeom.sf(int(sys.argv[4]) - 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3])))
	print

view raw gene_sets_hypergeometric_test.py hosted with ❤ by GitHub

	# phyper(overlap-1,list1,PopSize-list1,list2,lower.tail = FALSE, log.p = FALSE)

	phyper(100-1, 3000, 15000-3000, 400,lower.tail = FALSE, log.p = FALSE)

	#0.00784

view raw hypergeometric.r hosted with ❤ by GitHub

From the stackexchange link:
"The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)

These give 0.0078.

dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).

The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more "

If I run with the python script, I got the same p-value:

tommy@tommy-ThinkPad-T420 ~/scripts_general_use
$ ./hyper_geometric.py 15000 3000 400 100

total number in population: 15000
total number with condition in population: 3000
number in subset: 400
number with condition in subset: 100

p-value <= 100: 0.996049535673

p-value >= 100: 0.00784035207977

The bottom line is that we need more statistics!!

Thursday, October 23, 2014

convert bam file to bigwig file and visualize in UCSC genome browser in a Box (GBiB)

I just came across UCSC genome browser in a box Gbib, and I want to try it out.

what is Gbib?
from the link above:

"Genome Browser in a Box (GBiB) is a "virtual machine" of the entire UCSC Genome Browser website that is designed to run on most PCs (Windows, Mac OSX or Linux). GBiB allows you to access much of the UCSC Genome Browser's functionality from the comfort of your own computer. It is particularly directed at individuals who want to use the Genome Browser's functionality to view protected data. With the anticipation that the majority of protected data use will focus on the human genomes (primarily GRCh37/hg19, and transitioning to GRCh38/hg38), the current version of GBiB has been optimized for use with the hg19 assembly. Many other recent genome assemblies can also be viewed, but without mirroring of additional data, they may be slow."

Download the compressed file from here https://genome-store.ucsc.edu/ and follow the installation instructions. Open virtual box and start the virtual machine and then direct your web browser to 127.0.0.1:1234, you should see the UCSC genome browser page.

It looks the same as the UCSC genome browser page https://genome.ucsc.edu/. You can then go to the browser and do what you usually do in the genome browser. when I searched a gene, and errors came out:

nextRow failed
mySQL error 1194: Table 'rmsk' is marked as crashed and should be repaired (profile=db, host=localhost, db=hg19)

I asked the UCSC mail list, and I typed the command below in the console:
sudo mysqlcheck --all-databases --auto-repair --fast
it works like a charm now!

I have some bam files and I want to visualize them in the UCSC genome browser. I would first convert them to bigwig http://genome.ucsc.edu/goldenpath/help/bigWig.html and then upload to some open http https or ftp locations. see my old post:
hosting bigwig by Dropbox for UCSC

However, with GBiB one only needs to share a local folder containing the bigwig files with virtual box.

"Loading Local Big Data Tracks

Your computer can share directories with GBiB so that you can load big files without the need to upload them to a web server. The big file formats are compressed, indexed, binary files, and include bigBed,bigWig, BAM, and VCF files. Normally, one would have to place these types files onto a publicly accessible web server to upload them as a custom track. However, GBiB acts as its own web server, meaning that you can share local files with GBiB for easy uploading as a custom track."

I will start from converting bam to bigwig.
several posts you may want to have a look:
https://www.biostars.org/p/64495/#64680
https://www.biostars.org/p/2699/
http://onetipperday.blogspot.com/2012/07/three-ways-to-convert-bambed-file-to.html

I used genomeCoverageBed from bedtools to convert bam to bedgraph first, and then used UCSC utilities bedClip and bedGraphToBigWig http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ to do the conversion.

bedtools requires the bam files be sorted. so samtools sort was used to sort the bam file first.
I used a script from Tao Liu (the author of famous MACS ChIP-seq calling software) to do the conversion after I got the bedgraph files. I then put the bigwig files in a folder and shared it with virtual box. Follow the instructions http://genome.ucsc.edu/goldenPath/help/gbib#LocalTracks. I successfully visualized a bunch of bigwig files!

	#! /bin/bash

	for bam in *bam
	do
	echo $bam
	genomeCoverageBed -ibam $bam -bg -g hg19.genome.info > $(basename $bam .bam).bdg
	done

	for bdg in *bdg
	do
	echo $bdg
	bdg2bw $bdg hg19.genome.info
	done

view raw bam2bw.sh hosted with ❤ by GitHub

	#!/bin/bash

	# this script is from Tao Liu https://gist.github.com/taoliu/2469050
	# check commands: slopBed, bedGraphToBigWig and bedClip

	which bedtools &>/dev/null \|\| { echo "bedtools not found! Download bedTools: <http://code.google.com/p/bedtools/>"; exit 1; }
	which bedGraphToBigWig &>/dev/null \|\| { echo "bedGraphToBigWig not found! Download: <http://hgdownload.cse.ucsc.edu/admin/exe/>"; exit 1; }
	which bedClip &>/dev/null \|\| { echo "bedClip not found! Download: <http://hgdownload.cse.ucsc.edu/admin/exe/>"; exit 1; }

	# end of checking

	if [ $# -lt 2 ];then
	echo "Need 2 parameters! <bedgraph> <chrom info>"
	exit
	fi

	F=$1
	G=$2

	bedtools slop -i ${F} -g ${G} -b 0 \| bedClip stdin ${G} ${F}.clip

	bedGraphToBigWig ${F}.clip ${G} ${F/bdg/bw}

	rm -f ${F}.clip

view raw bdg2bw.sh hosted with ❤ by GitHub

Wednesday, October 22, 2014

speed up grep

I had a list of gene names in txt file. There are around 500 genes with one gene name in one line, and I want to filter the gtf file from ensemble Homo_sapines.GRCh37.74gtf.gz

the gtf file contains 2244857 lines. I used grep to do it, but it takes very long (~1 hour).

what I used:

zcat Homo_sapines.GRCh37.74gtf.gz | grep -f gene_names.txt -w > my_genes.gtf

I searched on line, and found several posts in stackoverflow to speed up grep:
http://stackoverflow.com/questions/14602963/faster-grep-function-for-big-27gb-files
http://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up
http://stackoverflow.com/questions/9066609/fastest-possible-grep

options to speed up:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression

I then used:

zcat Homo_sapines.GRCh37.74gtf.gz | LC_ALL=C fgrep -f gene_names.txt -w > my_genes.gtf

It runs much faster!

Diving into Genetics and Genomics

My github papge