I have three microarray or RNA-seq data sets (control, treatment1, treatment2, each should have several replicates, otherwise you can not get a pvalue for differential expression!) , I then get two sets of differentially up-regulated genes ( treatment1 vs control, treatment2 vs control). There are some overlapping genes which are both up-regulated by treatment1 and treatment2. How significant the overlapping is?

The easiest way is to use an online program http://nemates.org/MA/progs/overlap_stats.html

The program calculates the p value by using Hypergeometric_distribution.

see below for a python and an R script for this problem.

read posts here http://stats.stackexchange.com/questions/16247/calculating-the-probability-of-gene-list-overlap-between-an-rna-seq-and-a-chip-c

http://blog.nextgenetics.net/?e=94

https://www.biostars.org/p/90662/

list1=3000, list2=400, Total gene number (population)=15000,

overlapping between list1 and list2 =100

From the stackexchange link:

"The p-value you want is the probability of getting 100

*or more*white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

```
sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)
```

These give 0.0078.

`dhyper(x, m, n, k)`

gives the probability of drawing exactly `x`

. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.`phyper(x, m, n, k)`

gives the probability of getting `x`

or fewer, so `phyper(x, m, n, k)`

is the same as `sum(dhyper(0:x, m, n, k))`

.
The

If I run with the python script, I got the same p-value:`lower.tail=FALSE`

is a bit confusing. `phyper(x, m, n, k, lower.tail=FALSE)`

is the same as `1-phyper(x, m, n, k)`

, and so is the probability of `x+1`

or more "
tommy@tommy-ThinkPad-T420 ~/scripts_general_use

$ ./hyper_geometric.py 15000 3000 400 100

total number in population: 15000

total number with condition in population: 3000

number in subset: 400

number with condition in subset: 100

p-value <= 100: 0.996049535673

p-value >= 100: 0.00784035207977

The bottom line is that we need more statistics!!
$ ./hyper_geometric.py 15000 3000 400 100

total number in population: 15000

total number with condition in population: 3000

number in subset: 400

number with condition in subset: 100

p-value <= 100: 0.996049535673

p-value >= 100: 0.00784035207977

Alternative way to do it would be to use fisher.test in R :

ReplyDeletelist1=3000

list2=400

overlap=100

popsize=15000

a=popsize-list1-list2+overlap

b=list2-overlap

c=list1-overlap

d=overlap

fisher.test(matrix(c(a,b,c,d),nrow=2,byrow=T),alternative="greater")

reprfact=overlap/((list1*list2)/popsize)

reprfact # representation factor

OR=(a*d)/(b*c)

OR # odds ratio