Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, October 30, 2014

Test significance of overlapping genes from 2 gene sets

The question is very straightforward:
I have three microarray or RNA-seq data sets (control, treatment1, treatment2, each should have several replicates, otherwise you can not get a pvalue for differential expression!) , I then get two sets of differentially up-regulated genes ( treatment1 vs control, treatment2 vs control).  There are some overlapping genes which are both up-regulated by treatment1 and treatment2. How significant the overlapping is?

The easiest way is to use an online program
The program calculates the p value by using Hypergeometric_distribution.

see below for a python and an R script for this problem.

read  posts here

list1=3000, list2=400, Total gene number (population)=15000,
overlapping between list1 and list2 =100
From the stackexchange link:
"The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.
sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)
These give 0.0078.
dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.
phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).
The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more "

If I run with the python script, I got the same p-value:
tommy@tommy-ThinkPad-T420 ~/scripts_general_use
$ ./ 15000 3000 400 100

total number in population: 15000
total number with condition in population: 3000
number in subset: 400
number with condition in subset: 100

p-value <= 100: 0.996049535673

p-value >= 100: 0.00784035207977
The bottom line is that we need more statistics!!

1 comment:

  1. Alternative way to do it would be to use fisher.test in R :
    reprfact # representation factor
    OR # odds ratio