**UPDATE 09/23/1016.**

please also read this Nature Biotech primer

## How does multiple testing correction work?

I am writing this post for my own later references. Deep understanding of p-value, FDR and q-value is not trivial, and many biologists are misusing and/or misinterpretering them.

For biologists' sake, I will use an example of gene expression. Suppose we have two groups of cells: control and treatment (can be anything like chemical treatment, radiation treatment etc..). We are looking if Gene A is differentially expressed or not under treatment. Each group we have 12 replicates.

What we usually do is take the average of 12 replicates of each group and do a t-test to compare if the difference is significant or not (assume normal distribution). We then get a p-value, say p = 0.035. We know it is smaller than 0.05 (a threshold we set), and we conclude that after treatment, expression of Gene A is significantly changed. However, what does it mean by saying a p value of 0.035?

Everything starts with a null hypothesis:

H0 : There are no difference of gene expression for Gene A after treatment.

and an alternative hypothesis:

H1: After treatment, expression of Gene A changes.

**The definition of every P value begins by assuming a null hypothesis is True**[1].

**With a p-value of 0.035, it means that under the Null, the probability that we see the difference of gene expression after treatment is 0.035,**which is very low. If we choose a significant level of alpha=0.05, we then reject the Null hypothesis and accept the alternative hypothesis. So, if you can not state what the null hypothesis is, you can not understand the P value[1].

For a typical genomic study, there are thousands of genes we want to compare. How do we report the gene list containing the genes that are differentially expressed? We can perform a-test for each single gene and if the p-value is smaller than 0.05, we report it. However, it will give us a lot of false positives because we did not consider multiple tests. see the gist below.

**p values are random variables.**Mathmetically, one can demonstrate that under the null hypothesis (and some assumptions are met, in this case, the test statistic T follows standard normal distritubtion),

**p-values follow a uniform (0,1) distribution, which means that P(p < p1) = p1**. This means that the probability see a p value smaller than p1 is equal to p1. That being said, with a 100 t-tests, under the null (no difference between control and treatment), we will see 1 test with a p value smaller than 0.01. And we will see 2 tests with a p value smaller than 0.02 etc...

This explains why we see some p-values are smaller than 0.05 in our randomly generated numbers.

How do we control the false positives for multiple comparisons?

One way is to use the

**Bonferroni correction**to correct the familywise error rate (FWER):

define a particular comparison as statistically significant only when the P value is less than alpha(often 0.05) divided by the number of comparisons (p < alpha/m) [2]. Say we computed 100 t-tests, and got 100 p values, we only consider the genes with a p value smaller than 0.05/100 as significant. This approach is very conservative and is used in Genome-wide association studies (GWAS). Since we often compare millions of genetic variations between (tens of thousands) cases and controls, this threshold will be very small! [3]

Alternatively, we can use False Discovery Rate (FDR) to report the gene list.

FDR = #false positives/# called significant.

This approach does not use the term statistically significant but instead use the term discovery.

**Let's control FDR for a gene list with FDR = 0.05.**

**It means that of all the discoveries, 5% of them is expected to be false positives.**

Benjamini & Hochberg (BH method) in 1995 proposed a way to control FDR:

Let k be the largest i such that p(i) <= (i/m) * alpha, (m is the number of comparisons)

then reject H(i) for i =1, 2, ...k

This process controls the FDR at level alpha. The method sets a different threshold p value for each comparison. Say we computed 100 t-tests, and got 100 p values, and we want to control the FDR =0.05. We then rank the p values from small to big.

if p(1) <= 1/100 * 0.05, we then reject null hypothesis and accept the alternative.

if p(2) < = 2/100 * 0.05, we then reject the null and accept the alternative..

.....

Let's zoom in to look at the first 15 smallest p values

we can see that the 14th p value is bigger than its own threshold ,which is computed by (0.05/m) * 14 = 7.960878e-05

we will use p.adjust function and the method "fdr" or "BH" to correct the p value, what the p.adjust function does is to recalculate the p-values.

p(i)<= (i/m) * alpha

p(i) * m/i <= alpha

we can then only accept the returned the p values if p.adjust(pvals) <= alpha

sum( p.adjust(pvals, method="fdr") < 0.05 )

# 13, the same as we saw from the figure

Another method by Storey in 2002 is the direct approach to FDR:

Let K be the largest i such that pi_0 * p(i) < (i/m) * alpha

then reject H(i) for i =1,2,...k

pi_0 is the estimate of the proportion of null hypothesis in the gene list is true, range from 0 to 1.

so when pi_0 is 1, then we have the Benjamini & Hochberg correction.

This method is less conservative than the BH method.

Use the qvalue function in the bioconductor package "qvalue"

sum( qvalue(pvals)$qvalues < 0.05)

# 17, less conservative than the BH method

**Note that FDR is a property of a list of genes.**

**q value is defined for a specific gene:**

"But if you do want to assign a number to each gene, a simple thing you can do, is you can go gene by gene, and decide what would be the smallest FDR I would consider, that would include this gene in the list. And once you do that, then you have defined a q-value. And this is something that is very often reported in the list of genes"[4]

"To define the q-value we order features we tested by p-value then compute the FDRs for a list with the most significant, the two most significant, the three most significant, etc... The FDR of the list with the, say, m most significant tests is defined as the q-value of the m-th most significant feature. In other words, the q-value of a feature, is the FDR of the biggest list that includes that gene" [5]updated on 03/16/2015

See a paper in PlosOne

# The Extent and Consequences of P-Hacking in Science

and a post here: http://acceleratingscience.com/proteomics/whats-true-whats-false-proteostats-and-the-fdr/?utm_source=twitterfeed&utm_medium=twitter

[1] Intuitive biostatistics by Harvey Motulsky Third Edition page 127.

[2] Intuitive biostatistics by Harvey Motulsky Third Edition page 187.

[3] Intuitive biostatistics by Harvey Motulsky Third Edition page 188.

[4] HarvardX: PH525.3x Advanced Statistics for the Life Sciences, week1, video lecture for FDR.

[5] HarvardX: PH525.3x Advanced Statistics for the Life Sciences, week1, quiz for FDR.

https://courses.edx.org/courses/HarvardX/PH525.3x/1T2015/courseware/92dbe89dc80c4fc084cad3f00b0381f7/

## No comments:

## Post a Comment