I have three microarray or RNA-seq data sets (control, treatment1, treatment2, each should have several replicates, otherwise you can not get a pvalue for differential expression!) , I then get two sets of differentially up-regulated genes ( treatment1 vs control, treatment2 vs control). There are some overlapping genes which are both up-regulated by treatment1 and treatment2. How significant the overlapping is?
The easiest way is to use an online program http://nemates.org/MA/progs/overlap_stats.html
The program calculates the p value by using Hypergeometric_distribution.
see below for a python and an R script for this problem.
read posts here http://stats.stackexchange.com/questions/16247/calculating-the-probability-of-gene-list-overlap-between-an-rna-seq-and-a-chip-c
http://blog.nextgenetics.net/?e=94
https://www.biostars.org/p/90662/
list1=3000, list2=400, Total gene number (population)=15000,
overlapping between list1 and list2 =100
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#! /usr/bin/env python | |
import sys | |
import scipy.stats as stats | |
#The result will be | |
# a p-value where by random chance number of genes with both condition A and B will be <= to your number with condition A and B | |
# a p-value where by random chance number of genes with both condition A and B will be >= to your number with condition A and B | |
# The second p-value is probably what you want. | |
if len(sys.argv) < 5: | |
print '''use it by: python gene_sets_hypergeomtric_test.py [total number of genes in the list] [total number of genes in the list with condition A] [total number of genes in the list with condition B] [number of genes with both condition A and B]''' | |
sys.exit() | |
print 'total number in population: ' + sys.argv[1] | |
print 'total number with condition in population: ' + sys.argv[2] | |
print 'number in subset: ' + sys.argv[3] | |
print 'number with condition in subset: ' + sys.argv[4] | |
print 'p-value <= ' + sys.argv[4] + ': ' + str(stats.hypergeom.cdf(int(sys.argv[4]) + 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3]))) | |
print 'p-value >= ' + sys.argv[4] + ': ' + str(stats.hypergeom.sf(int(sys.argv[4]) - 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3]))) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# phyper(overlap-1,list1,PopSize-list1,list2,lower.tail = FALSE, log.p = FALSE) | |
phyper(100-1, 3000, 15000-3000, 400,lower.tail = FALSE, log.p = FALSE) | |
#0.00784 |
"The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.
sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)
These give 0.0078.
dhyper(x, m, n, k)
gives the probability of drawing exactly x
. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.phyper(x, m, n, k)
gives the probability of getting x
or fewer, so phyper(x, m, n, k)
is the same as sum(dhyper(0:x, m, n, k))
.
The
If I run with the python script, I got the same p-value:lower.tail=FALSE
is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE)
is the same as 1-phyper(x, m, n, k)
, and so is the probability of x+1
or more "
tommy@tommy-ThinkPad-T420 ~/scripts_general_use
$ ./hyper_geometric.py 15000 3000 400 100
total number in population: 15000
total number with condition in population: 3000
number in subset: 400
number with condition in subset: 100
p-value <= 100: 0.996049535673
p-value >= 100: 0.00784035207977
The bottom line is that we need more statistics!!
$ ./hyper_geometric.py 15000 3000 400 100
total number in population: 15000
total number with condition in population: 3000
number in subset: 400
number with condition in subset: 100
p-value <= 100: 0.996049535673
p-value >= 100: 0.00784035207977