Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, October 30, 2014

Test significance of overlapping genes from 2 gene sets

The question is very straightforward:
I have three microarray or RNA-seq data sets (control, treatment1, treatment2, each should have several replicates, otherwise you can not get a pvalue for differential expression!) , I then get two sets of differentially up-regulated genes ( treatment1 vs control, treatment2 vs control).  There are some overlapping genes which are both up-regulated by treatment1 and treatment2. How significant the overlapping is?

The easiest way is to use an online program http://nemates.org/MA/progs/overlap_stats.html
The program calculates the p value by using Hypergeometric_distribution.

see below for a python and an R script for this problem.

read  posts here http://stats.stackexchange.com/questions/16247/calculating-the-probability-of-gene-list-overlap-between-an-rna-seq-and-a-chip-c
http://blog.nextgenetics.net/?e=94
https://www.biostars.org/p/90662/

list1=3000, list2=400, Total gene number (population)=15000,
overlapping between list1 and list2 =100
From the stackexchange link:
"The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.
sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)
These give 0.0078.
dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.
phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).
The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more "

If I run with the python script, I got the same p-value:
tommy@tommy-ThinkPad-T420 ~/scripts_general_use
$ ./hyper_geometric.py 15000 3000 400 100

total number in population: 15000
total number with condition in population: 3000
number in subset: 400
number with condition in subset: 100

p-value <= 100: 0.996049535673

p-value >= 100: 0.00784035207977
The bottom line is that we need more statistics!!

1 comment:

  1. Alternative way to do it would be to use fisher.test in R :
    list1=3000
    list2=400
    overlap=100
    popsize=15000
    a=popsize-list1-list2+overlap
    b=list2-overlap
    c=list1-overlap
    d=overlap
    fisher.test(matrix(c(a,b,c,d),nrow=2,byrow=T),alternative="greater")
    reprfact=overlap/((list1*list2)/popsize)
    reprfact # representation factor
    OR=(a*d)/(b*c)
    OR # odds ratio

    ReplyDelete