Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Showing posts with label hypergeometric. Show all posts
Showing posts with label hypergeometric. Show all posts

Thursday, October 30, 2014

Test significance of overlapping genes from 2 gene sets

The question is very straightforward:
I have three microarray or RNA-seq data sets (control, treatment1, treatment2, each should have several replicates, otherwise you can not get a pvalue for differential expression!) , I then get two sets of differentially up-regulated genes ( treatment1 vs control, treatment2 vs control).  There are some overlapping genes which are both up-regulated by treatment1 and treatment2. How significant the overlapping is?

The easiest way is to use an online program http://nemates.org/MA/progs/overlap_stats.html
The program calculates the p value by using Hypergeometric_distribution.

see below for a python and an R script for this problem.

read  posts here http://stats.stackexchange.com/questions/16247/calculating-the-probability-of-gene-list-overlap-between-an-rna-seq-and-a-chip-c
http://blog.nextgenetics.net/?e=94
https://www.biostars.org/p/90662/

list1=3000, list2=400, Total gene number (population)=15000,
overlapping between list1 and list2 =100
From the stackexchange link:
"The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.
sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)
These give 0.0078.
dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.
phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).
The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more "

If I run with the python script, I got the same p-value:
tommy@tommy-ThinkPad-T420 ~/scripts_general_use
$ ./hyper_geometric.py 15000 3000 400 100

total number in population: 15000
total number with condition in population: 3000
number in subset: 400
number with condition in subset: 100

p-value <= 100: 0.996049535673

p-value >= 100: 0.00784035207977
The bottom line is that we need more statistics!!