Sunday, March 22, 2015

benchmarking for shuf vs fast_sample

Sometimes we want to randomly select a proportion of lines from a txt file. The easiest way is to use the Unix command shuf. On a mac machine, you can install it by home brew.

brew install coreutils

But you need to invoke it as gshuf. https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-command-line-tools-in-mac-os-x/
I also came across a tool called fast_sample that can do the same thing
https://github.com/earino/fast_sample
I did some benchmarking for them.

The take home message for the benchmarking is that Unix tools sometimes are better than tools you write in terms of speed and memory efficiency.

Saturday, March 14, 2015

understanding p value, multiple comparisons,FDR and q value

UPDATE 09/23/1016.
please also read this Nature Biotech primer

How does multiple testing correction work?

I am writing this post for my own later references. Deep understanding of p-value, FDR and q-value is not trivial, and many biologists are misusing and/or misinterpretering them.

For biologists' sake, I will use an example of gene expression. Suppose we have two groups of cells: control and treatment (can be anything like chemical treatment, radiation treatment etc..). We are looking if Gene A is differentially expressed or not under treatment. Each group we have 12 replicates.

What we usually do is take the average of 12 replicates of each group and do a t-test to compare if the difference is significant or not (assume normal distribution). We then get a p-value, say p = 0.035. We know it is smaller than 0.05 (a threshold we set), and we conclude that after treatment, expression of Gene A is significantly changed. However, what does it mean by saying a p value of 0.035?

Everything starts with a null hypothesis:
H0 : There are no difference of gene expression for Gene A after treatment.

and an alternative hypothesis:
H1: After treatment, expression of Gene A changes.

The definition of every P value begins by assuming a null hypothesis is True[1]. With a p-value of 0.035, it means that under the Null, the probability that we see the difference of gene expression after treatment is 0.035, which is very low. If we choose a significant level of alpha=0.05, we then reject the Null hypothesis and accept the alternative hypothesis. So, if you can not state what the null hypothesis is, you can not understand the P value[1].

For a typical genomic study, there are thousands of genes we want to compare. How do we report the gene list containing the genes that are differentially expressed? We can perform a-test for each single gene and if the p-value is smaller than 0.05, we report it. However, it will give us a lot of false positives because we did not consider multiple tests. see the gist below.

Even if we randomly generated the data, you still see some pvalues are smaller than 0.05 (lower figure)!! We randomly generated data, there should be no genes that differentially expressed. However, we see a flat line across different p values.
p values are random variables. Mathmetically, one can demonstrate that under the null hypothesis (and some assumptions are met, in this case, the test statistic T follows standard normal distritubtion), p-values follow a uniform (0,1) distribution, which means that P(p < p1) = p1. This means that the probability see a p value smaller than p1 is equal to p1. That being said, with a 100 t-tests, under the null (no difference between control and treatment), we will see 1 test with a p value smaller than 0.01. And we will see 2 tests with a p value smaller than 0.02 etc...
This explains why we see some p-values are smaller than 0.05 in our randomly generated numbers.

How do we control the false positives for multiple comparisons?
One way is to use the Bonferroni correction to correct the familywise error rate (FWER):
define a particular comparison as statistically significant only when the P value is less than alpha(often 0.05) divided by the number of comparisons (p < alpha/m) [2]. Say we computed 100 t-tests, and got 100 p values, we only consider the genes with a p value smaller than 0.05/100 as significant. This approach is very conservative and is used in Genome-wide association studies (GWAS). Since we often compare millions of genetic variations between (tens of thousands) cases and controls, this threshold will be very small! [3]

Alternatively, we can use False Discovery Rate (FDR) to report the gene list.
FDR = #false positives/# called significant.
This approach does not use the term statistically significant but instead use the term discovery.
Let's control FDR for a gene list with FDR = 0.05.
It means that of all the discoveries, 5% of them is expected to be false positives.

Benjamini & Hochberg (BH method) in 1995 proposed a way to control FDR:
Let k be the largest i such that p(i) <= (i/m) * alpha, (m is the number of comparisons)
then reject H(i) for i =1, 2, ...k
This process controls the FDR at level alpha. The method sets a different threshold p value for each comparison. Say we computed 100 t-tests, and got 100 p values, and we want to control the FDR =0.05. We then rank the p values from small to big.
if p(1) <= 1/100 * 0.05, we then reject null hypothesis and accept the alternative.
if p(2) < = 2/100 * 0.05, we then reject the null and accept the alternative..
.....

Let's zoom in to look at the first 15 smallest p values

we can see that the 14th p value is bigger than its own threshold ,which is computed by (0.05/m) * 14 = 7.960878e-05
we will use p.adjust function and the method "fdr" or "BH" to correct the p value, what the p.adjust function does is to recalculate the p-values.
p(i)<= (i/m) * alpha
p(i) * m/i <= alpha
we can then only accept the returned the p values if p.adjust(pvals) <= alpha
sum( p.adjust(pvals, method="fdr") < 0.05 )
# 13, the same as we saw from the figure

Another method by Storey in 2002 is the direct approach to FDR:
Let K be the largest i such that pi_0 * p(i) < (i/m) * alpha
then reject H(i) for i =1,2,...k
pi_0 is the estimate of the proportion of null hypothesis in the gene list is true, range from 0 to 1.
so when pi_0 is 1, then we have the Benjamini & Hochberg correction.
This method is less conservative than the BH method.
Use the qvalue function in the bioconductor package "qvalue"
sum( qvalue(pvals)$qvalues < 0.05)
# 17, less conservative than the BH method

Note that FDR is a property of a list of genes.
q value is defined for a specific gene:

"But if you do want to assign a number to each gene, a simple thing you can do, is you can go gene by gene, and decide what would be the smallest FDR I would consider, that would include this gene in the list. And once you do that, then you have defined a q-value. And this is something that is very often reported in the list of genes"[4]

"To define the q-value we order features we tested by p-value then compute the FDRs for a list with the most significant, the two most significant, the three most significant, etc... The FDR of the list with the, say, m most significant tests is defined as the q-value of the m-th most significant feature. In other words, the q-value of a feature, is the FDR of the biggest list that includes that gene" [5]

updated on 03/16/2015
See a paper in PlosOne

The Extent and Consequences of P-Hacking in Science

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106

and a post here: http://acceleratingscience.com/proteomics/whats-true-whats-false-proteostats-and-the-fdr/?utm_source=twitterfeed&utm_medium=twitter

References:

[1] Intuitive biostatistics by Harvey Motulsky Third Edition page 127.
[2] Intuitive biostatistics by Harvey Motulsky Third Edition page 187.
[3] Intuitive biostatistics by Harvey Motulsky Third Edition page 188.
[4] HarvardX: PH525.3x Advanced Statistics for the Life Sciences, week1, video lecture for FDR.
[5] HarvardX: PH525.3x Advanced Statistics for the Life Sciences, week1, quiz for FDR.
https://courses.edx.org/courses/HarvardX/PH525.3x/1T2015/courseware/92dbe89dc80c4fc084cad3f00b0381f7/

Diving into Genetics and Genomics

My github papge