In blog No. 6, “Dentente – The Peaceful Co-existence of Significance Levels and Bayes,” I introduced this blog to the notion of using Bayes Factor to assess the probability that a hypothesis is true or false. As a reminder to those who have read “Détente” or as new material for those who are not familiar with this, I have copied the appropriate paragraph from Blog No. 6.
“Using a point prior, there is a simple approximation for computing the probability of H_{0} being false using the Bayes Factor Bound (BFB). which is based on reasonable, practical assumptions [1]. Let p_{0} be the prior probability that H_{0} is false and let p=p-value from the test of H_{0} from the current experiment. Then the Bayes Factor Bound is
BFB=1/[-e*p*ln(p)],
and the upper bound on the posterior probability that H_{0} is false (p_{1}) given the observed data is
p_{1} ≤ p_{0}*BFB/(1-p_{0}+p_{0}*BFB) (Equation 1).”
I like this representation of knowledge – or at least the probability of that knowledge – on the probability scale rather than on the odds scale, which is what is often conveyed in publications like those from Jim Berger and colleagues [2, 3] (BTW – I really like Jim Berger’s work on this … and many other things). I think that people deal with probability better than odds (i.e. 67% chance of success rather than 2:1 odds), but I could be wrong. Maybe more people bet at the racetrack than I am aware of!
In Blog No. 5 “pr(You’re Bayesian) > 0.50,” hopefully I convinced you that one cannot interpret a p-value by itself. It can only be interpreted in the context of prior knowledge, or more specifically a prior probability.
In any case, I really like this representation of knowledge, and it provides an insight into a statement I have heard used by many …
“A p-value of 0.05 is not very strong evidence against the null hypothesis.”
I have heard this from Drs. Bob O’Neill and Bob Temple, who have been or are prominent leaders at the Food & Drug Administration; I have heard it in talks from many Bayesian; and I have even heard the view ascribed to, or at least interpreted to be from, Sir Ronald Fisher himself [Note that I have not tried to track down all these references specifically for this blog. I am speaking from years of personal experience and sitting in audiences at statistical conferences etc.]
Interestingly, in the publications from the Wiley Brand “dummies” under Statistics, one can find the following statement:
“A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.”
This might be a view held by many scientists and statisticians who are engrained in the hypothesis testing/frequentist approach to inference and knowledge generation. In fact, you can find such statements in many introductory statistics textbooks. [Again, no references (yet), but just some personal experiences and simple Google searching.]
Let’s investigate these divergent views a little further.
Since becoming more thoughtful about (a) evidence, (b) Bayesian thinking and (c) understanding what is likely to be true or false, my reaction to the statement, “A p-value of 0.05 is not very strong evidence against the null hypothesis” has been … well, it depends on your prior. If you start with a low prior belief in the null hypothesis being false, say less than 0.20 (e.g. as at the beginning of Phase 2 studies), then indeed p=0.05 is informative and moving in the right direction but is not very convincing in the sense of strong evidence. If you start with a high prior belief in the null hypothesis being false, say 0.70 (e.g. perhaps at the beginning of Phase 3 studies), then a p-value=0.05 is indeed much more convincing in the sense of strong evidence against the null hypothesis.
This prompted me to investigate this relationship between prior belief/probability that the null hypothesis is false and a p-value=0.05. Using Eq. 1 above, it is simple to create Table 1. It is the last column that is most informative.
Table 1
The relationship between prior belief/probability that the
null hypothesis is false and a p-value=0.05
for testing that hypothesis using Bayes Factor Bound.
Prior (belief, probability) | p-value | Posterior (upper bound) | Increase (post – prior) |
0.1 | 0.05 | 0.214 | 0.114 |
0.2 | 0.05 | 0.380 | 0.180 |
0.3 | 0.05 | 0.513 | 0.213 |
0.4 | 0.05 | 0.621 | 0.221 |
0.5 | 0.05 | 0.711 | 0.211 |
0.6 | 0.05 | 0.787 | 0.187 |
0.7 | 0.05 | 0.851 | 0.151 |
As seen in the last column, a p=0.05 doesn’t move the evidentiary needle very much. If your prior belief is expressed as a probability that the null hypothesis is false of 0.20, and you observe a p-value of 0.05, then your maximum posterior probability that the null hypothesis is false is 0.38. Thus, you have moved the evidentiary needle from 0.20 (prior) to 0.38 (posterior) – a nudge in the right directions but hardly convincing. In the last column, all values for (gain = posterior – prior) are about 0.22 or less. So, a p-value of 0.05 really is rather meager evidence … or might I say rather insignificant.
Thus, in the Bayesian way of thinking, “A p-value of 0.05 is not very strong evidence against the null hypothesis” is a reasonable statement, but is probably better stated as,
“A p-value = 0.05 does not move the ‘evidentiary needle’ very much!”
In my experience, the statement, “A p-value of 0.05 is not very strong evidence against the null hypothesis” comes from some pretty staunch frequentists. It is interesting, and a bit ironic, that it takes a Bayesian perspective to explain this notion formally and quantify the degree of evidence p=0.05 conveys!
We can take this one step further and calculate the Maximum Gain that any p-value produces, thus conveying the value of a p-value in terms of its evidentiary gain or, perhaps, scientific value. A few technical details are given in the separate section below, and the summary of the results is in Table 2.
Table 2
Maximum Gain in “Evidence” as Measured by the Increase in Probability
from the Prior to the Posterior based on an Observed P-Value
p-value | Bayes Factor Bound | Max Gain in Evidence | Prior at Max Gain | |
0.1 | 1.598 | 0.117 | 0.442 | |
0.05 | 2.456 | 0.221 | 0.390 | |
0.01 | 7.988 | 0.477 | 0.261 | |
0.005 | 13.89 | 0.577 | 0.212 | |
0.001 | 53.26 | 0.759 | 0.121 | |
0.0005 | 96.80 | 0.815 | 0.092 | |
0.0001 | 399.4 | 0.905 | 0.048 |
Note that the second row of the body of Table 2 aligns with the fourth row of the body of Table 1 – a p-value of 0.05 confers at most a gain of 0.22 probability on the “evidentiary scale.” This maximum gain occurs when the prior is 0.39. It is also worth noting that this Max Gain is the maximum across all prior values (p_{0}), but also, in and of itself, it is derived from the maximum posterior probability (Eq. 1). So, it is a maximum of all possible maximums.
What is also interesting to note in Table 2 is that there has been a serious proposal by influential and knowledgeable scientists and statisticians to make the scientific standard for significance p<0.005 [4]. In the context of this blog, one can see that such a p-value can confer an increase of almost 0.60 in probability on the evidentiary scale. Now that is a significant movement of the evidentiary needle. So, it is not an unreasonable idea to demand such small p-values (i.e. stronger evidence), but it does have other issues (e.g. p-hacking may become even more severe to achieve this) as well as NOT addressing the fundamental issue:
A p-value cannot be interpreted without a prior.
For example, a prior of 0.01 for the null hypothesis being false (say, for some very exploratory study/analysis of multiple biomarkers or other predictors of response) that ends up with a “significant” p-value of 0.005 for that null hypothesis results in an upper bound on the posterior probability of the null hypothesis being false of only 0.12 using Eq. 1. While such a posterior probability may be noteworthy enough to continue investing in that line of research, it is hardly worth declaring victory and asserting a new scientific finding.
So, now you know.
For the Mathematically Inclined
The maximal gain in probability (posterior – prior) conferred by any p-value can be written using Eq. 1 as a strict equality (the posterior) and simply subtracting the prior.
Max Gain = f(p_{0}) = (p_{1} – p_{0}) = p_{0}*BFB/(1-p_{0}+p_{0}*BFB) – p_{0} (Equation 2).
Thus, for any given p-value (which is encapsulated in the BFB), the gain will vary as a function of the prior, p_{0}, and _{ }we would like to know at what value of p_{0} the maximum occurs and what that maximum is. Thus, differentiating Eq. 2 wrt p_{0}, results in
f’(p_{0}) = [BFB / (1 – p_{0} + p_{0}*BFB)^{2}] – 1.
Setting this equal to zero and solving for p_{0}, we get a quadratic equation in p_{0} as follows (for simplicity let B=BFB):
(B^{2} – 2B +1) p_{0}^{2} + (2B – 2) p_{0} + (1-B) = 0.
The quadratic formula can be used to solve for p_{0} and then compute the Max Gain across all p_{0} in the interval (0,1). Table 2 shows the evidentiary value of a p-value, i.e. the maximum amount that it moves the evidentiary needle from prior to posterior. Keep in mind that the is the MAXIMUM evidentiary gain.
References
[1] Sellke, T., Bayarri, M. J., Berger, J. O. Calibration of p values for testing precise null hypotheses. The Amer. Statist. 55, 62-71 (2001).
[2] Berger, J. O., and Sellke, T. (1987), “Testing a Point Null Hypothesis: The Irreconcilability of p Values and Evidence,” Journal of the American Statistical Association, 82, 112–122.
[3] Benjamin, D., and Berger, J. (2019), “Three Recommendations for Improving the Use of p-Values,” The American Statistician, 73.
[4] Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E., Berk, R., … Johnson, V. (2017) ‘Redefine statistical significance.’ PsyArXiv, 22 July, available: https://psyarxiv.com/mky9j/?_ga=2.83175444.36461547.1556912282-865891723.15569 12282