**Background**

Once again, I start this blog with a reminder about Bayes Factor and its use in placing a probability on whether the null hypothesis is true or false. Obviously, knowing one gives us the other since

pr(H_{0} true)
= 1 – pr(H_{0} false).

In blogs Nos. 6 and 7, I noted

“Using a point prior, there is a simple approximation for computing the probability of H_{0} being false using the Bayes Factor Bound (BFB), which is based on reasonable, practical assumptions [1]. Let p_{0} be the prior probability that H_{0} is false and let p=p-value from the test of H_{0} from the current experiment. Then the Bayes Factor Bound is

BFB=1/[-e
** _{* }**p

**ln(p)],**

_{* }and the upper bound on the
posterior probability that H_{0} is false (p_{1}) given the observed
data is

p_{1}
≤ p_{0} ** _{* }**BFB/(1-p

_{0}+p

_{0}

**BFB) (Equation 1).**

_{* }I like this representation of knowledge – or at least the probability of that knowledge – on the probability scale rather than on the odds scale, which is what is often conveyed in publications like those from Jim Berger and colleagues [2, 3].”

Another way to write Eq. 1 is

**p _{1}<**

**{1 + [(1-p**(Equation 2).

_{0})/p_{0}] / BFB }^{-1}This formulation clearly shows how the bound on the **posterior probability of H _{0} being false** is a function of the

**prior odds**and the

**current data (encapsulated by the p-value in the BFB)**. This is precisely the heuristic for Bayesian thinking – using prior belief (preferably quantified as a probability or probability distribution) and current data (quantified as a p-value in this formulation) to create an updated probability (or at least its upper bound as in Eq. 2.) of the assertion of interest (i.e. whether a hypothesis is really false or not).

**Biomarker Application**

It is fairly common these days in the drug development
process (especially oncology) to be interested in biomarkers. There are two categories
of biomarkers – prognostic and predictive. Prognostic biomarkers are useful in
assessing a patient’s outcome regardless of any treatment. For example, women
with the BRCA gene are at much higher risk of developing breast or ovarian
cancers. Predictive biomarkers are used to predict ** differential efficacy or
safety response** for patients taking a particular treatment. That is, for
example, if the biomarker is present in a patient (known as marker positive or
M+), then such patients will respond better (or worse) to a particular
treatment than to a control treatment. This is a key element in personalized
medicine or targeted therapeutics. The medical field can measure predictive
biomarkers (if they exist) in a patient in order to find the best treatment for
THAT patient.

So, the hunt for biomarkers (prognostic and predictive) is an extremely important area of medical research and drug development – but one that is fraught with difficulties since human biology is enormously complex. Thus, in many instances there may only be a vague notion of which biomarkers might be involved in a disease process and which treatments might be best for patients with a particular biomarker. It is not uncommon to explore dozens of biomarkers, and in the case of genetic markers, scientists/clinicians may want to search many thousands of genetic variants to see which, if any, can be useful predictive biomarkers.

Of course, this can lead to false positive findings due to multiple testing. The more biomarkers involved in the search process and the smaller the sample size of patients in the investigation, the more likely there will be false positive findings. I am sure that everyone who is reading this blog (especially those directly involved in drug development) knows stories of research findings that look so promising in the lab or in early, small clinical trials, only to fail upon more rigorous testing or evaluation in larger clinical trials.

So, why do so many findings turn out to be false? (Some have
referred to this as a “reproducibility crisis” in science and I refer you to
the initial paper – “Why Most Research Findings are False” [4] – that opened this thinking to the scientific
community) How do we determine what effects are real versus spurious? Or said
more appropriately, what is the *probability* that an observed biomarker
effect is real?

Here is a Bayesian perspective on this that I believe goes a
long way toward explaining this problem. **The basic premise herein is that
there has been an over-reliance on p-values when we should be putting our attention
on Bayesian probabilities.** I will use a simple example, but one that is
based in reality, to make the point.

**Example – Contrasting p-values and Bayesian Probabilities**

Suppose there is interest in assessing 100 biomarkers as
potential predictive biomarkers (the same could be true for prognostic
biomarkers). A study is done, and a hypothesis test is done for each biomarker
regarding its predictive ability or “significance.” For one of the biomarkers, denoted
B*, the hypothesis test produces a p-value of 0.0001. Now, normally that might
seem amazing and strong evidence against the null hypothesis, but enlightened
researchers know that such a p-value needs to be considered in light of some
sort of adjustment for multiple testing. Let’s take the easiest AND *most conservative
approach* (since I want to guard against false positive findings) – the Bonferroni
correction.

This can be done easily in one of two ways. First, take the traditional
significance level of the hypothesis test of 0.05 and divide by the number of
tests – in this case 100. That gives an adjusted significance level of 0.0005. The
observed p-value of 0.0001 is smaller than the adjusted significance level and
therefore, is still statistically significant by the frequentist null hypothesis
significance testing (NHST) approach even using the most conservative multiplicity
adjustment method. The other and equivalent approach is to take the observed
p-value and multiply it by the number of comparisons (i.e. 100) to get an
adjusted p-value of 0.01, which is still considered highly significant and
cause to reject the null hypothesis. **Thus, B* seems to be a very promising biomarker.**

Now, we all know that the frequentist/NHST approach uses
only the data from this singular experiment for making inference. If we wanted
to use other information about these 100 biomarkers and incorporate it *formally*
into our analysis, we would have to take a Bayesian approach. What prior
information might we have and how would it be incorporated into our thinking
and inference about a large array of biomarkers?

When taking a Bayesian perspective (one which I had taken in my former life at Lilly), there is a need to develop a prior for the 100 hypotheses related to each biomarker. A simple, yet VERY valuable and credible, way to do this is to talk to the scientists and get their opinions about individual biomarkers or the set of biomarkers collectively. There is an entire literature on “prior elicitation” in the Bayesian inference world [5] that I will not get into at this time, but essentially, that is what I am proposing herein. In a particular situation in my former work that was similar to this example and the motivation for this approach, a small group of scientists with deep expertise in the biology of the disease and the mechanism of action of the experimental treatment collectively agreed that there was about a 20% chance that one of these 100 biomarkers would be a predictive biomarker to distinguish responders and non-responders to the experimental treatment. Written more formally, the prior is

pr(at least one H_{0}
of the 100 tested is false) = 0.20.

The scientists were not willing to bet on one biomarker or even a small set of biomarkers but interested in this entire panel of biomarkers related to the disease process. This low probability for a prior is quite reasonable given the highly exploratory nature of this research goal.

Since there are 100 biomarkers, each with equal chance of
being predictive according to the scientists, we can calculate a prior for each
H_{0} by dividing it uniformly over all biomarker hypotheses as

pr(any individual H_{0}
is false) = 0.20 / 100 = 0.002.

Now, back to the data with a p-value of 0.0001 for B* and
the use of Eq. 1 or Eq. 2 (they are equivalent), we can obtain the *upper
bound* on the posterior probability of H_{0} being false as

p_{1} <
0.44.

That is, the probability of that our most significant biomarker (B*) in the experiment (p=0.01 even after multiplicity adjustment!) has less than a 50/50 chance of being truly predictive. That’s quite a different answer than what is conveyed by the frequentist approach. If that surprises you, then see Blog No. 5 “The Pr( You’re Bayesian) > 0.50” for a clear explanation of how different a p-value is from a Bayesian posterior probability. They are as different as pr(A|B) and pr(B|A) or in some concrete terms, as I like to say, pr(cloudy|rain) and pr(rain|cloudy). Clearly, it is the latter – a Bayesian probability – that is relevant.

We do not have to divide the overall prior of 0.20 equally
over the 100 biomarker hypotheses if it is not warranted. For example, in some
situations I have experienced, the scientists/physician may order the
biomarkers according to their plausibility of being a predictive biomarker.
This can be based on previous research, the published literature or exquisite
knowledge of the disease or treatment. Thus, in this example, we could identify
biomarkers B_{1}-B_{10} as the most likely to be predictive.
Then, biomarkers B_{11}-B_{35} are remote possibilities for
being predictive, and finally B_{36}-B_{100} are purely
exploratory. One way we could arithmetically divide the 0.20 prior is as
follows:

- A prior of 0.10 is allocated to the most
promising set of biomarkers and therefore B
_{1}-B_{10}individually get (0.10 / 10) = 0.01 as a prior for their hypothesis tests. - A prior 0f 0.05 is allocated to the next most
important tier of biomarkers and therefore B
_{11}-B_{35}individually get (0.05 / 25) = 0.002 as a prior for their hypothesis tests. - The remaining prior of 0.05 is allocated to the exploratory
set of biomarkers and therefore B
_{36}-B_{100}individually get (0.05 / 65) = 0.00077 as a prior for their hypothesis tests.

Now, we can calculate the upper bound on the posterior probability of B* really being a predictive biomarker (Eq. 1 or 2), but it depends on which of the three categories it falls into as described above.

Category | Prior Probability that the Biomarker Effect is Real(i.e. H _{0} is False) | Upper Bound on Posterior Probability that its Predictive Biomarker Effect is Real(i.e. H _{0} is False) |

1 | 0.01 | 0.80 |

2 | 0.002 | 0.44 |

3 | 0.00077 | 0.24 |

Thus, for the SAME LEVEL OF EVIDENCE from the current experiment, (p=0.0001), we have VERY DIFFERENT CONCLUSIONS about whether B* is really a predictive biomarker for the new treatment. That is quite reasonable. If B* were in Category 3, a highly significant finding (i.e. p=0.0001) would be quite surprising and should be regarded cautiously as a possible false positive finding. The advantage of the Bayesian approach is that it quantifies the probability of H_{0} being false. Said differently, there is *at least* a (1-0.24)=0.76 probability that B* is a false positive finding (i.e. H_{0} is really true). If B* were in Category 1 where there was some evidence external to the current experiment that it could be a predictive biomarker, then the same significant finding should be viewed as adding evidence to what is already believed/known and there is a notably higher probability that the finding is real – as much as 0.80.

I will note that there are more sophisticated and comprehensive approaches to allocating a prior across many hypotheses [6] that will not be covered in this blog that is intended for a more general scientific audience.

**A Path Forward**

**So, what do we believe?** We should believe in the
Bayesian posterior probability, which is “calibrated” by our prior
knowledge/belief/evidence. The range of possible posterior probabilities for asserting
that we have identified a new predictive biomarker for our experimental
treatment is quite wide, depending on the prior. Unless B* is in Category 1, that
probability is low. Even the *adjusted* p-value of 0.01 by itself has very
little meaning in this context and should NOT be interpreted as strong evidence
against the null hypothesis (see Blog No. 7 for what a p-value is really
worth).

**Now, what should we do?** For this argument, assume B*
is in Category 2 or that we used a uniform prior across all 100 potential
biomarkers (i.e. the prior is 0.002). In drug development, especially earlier stages
and exploratory research, a probability of a new scientific finding of ~40%
(note: the bound calculated from the BFB is reasonably sharp, at least in one in-depth
examination [7]) could very well be enough evidence to move forward with additional
research. Our decisions should be governed by some sort of utility function. If
the next step in the research process costs $3 million dollars, then proceeding
with the identified biomarker could be quite reasonable. If the next step in
the research process costs $30 million dollars, then proceeding with the
identified biomarker might be questionable. It also depends on the pay-off at
the end of this research journey. If a successful research program results in a
new drug for an unmet medical need or for a large population of patients for
which the new treatment would be a better alternative to existing treatments,
then even a $30 million investment may be worth the risk.

**The point is this:**

- You must accurately quantify the probability of the real/ true state of nature (is this identified biomarker really predictive or not?) and base decisions on that probability, which is decidedly NOT the p-value (a measure of evidence from the current experiment). What is appropriate is the Bayesian posterior probability that combines prior knowledge with the current evidence.
- To calculate the Bayesian posterior probability,
a
*numerical*prior is needed. This can be derived or described in many ways and is not as difficult as many opponents of Bayesian approaches state. - Decisions should be based on the probability that the finding is a false positive (or a false negative in some cases) as well as the costs (financial or human burden/safety) and benefits (financial or medical/societal need) that are quantified in a utility function.

**References**

[1] Sellke, T., Bayarri, M. J., Berger, J. O.
Calibration of p values for testing precise null hypotheses. *The Amer.
Statist.* 55, 62-71 (2001).

[2] Berger, J. O.,
and Sellke, T. (1987), Testing a Point Null Hypothesis: The Irreconcilability
of p Values and Evidence. *Journal of the American Statistical Associatio*n,
82, 112–122.

[3] Benjamin, D.,
and Berger, J. (2019). Three Recommendations for Improving the Use of p-Values.
*The American Statistician*, 73:sup1, 186-191.

[4] Ioannidis JPA (2005) Why most published research
findings are false. *PLoS Med* 2(8): e124.

[5] O’Hagan, A (2019). Expert knowledge elicitation:
Subjective but scientific. *The American Statistician*, **73:sup1**, 69-81.

[6] Berger J.O., Wang X., Shen L. (2014). A Bayesian
approach to subgroup identification. *J Biopharmaceutical Statistics,* 24(1),
110-29.

[7] Bayarri, M. J., Benjamin, D., Berger, J., and Sellke, T.
(2016). Rejection Odds and Rejection Ratios: A Proposal for Statistical
Practice in Testing Hypotheses. *Journal of Mathematical Psychology*, 72,
90–103.

Wonderful post Steve. The only thing I’ll argue with is the way you derived a multiplicity-adjusted prior. The prior probability or prior belief that a certain assertion is true should come solely from the belief about that individual assertion. The number of other assertions evaluated should not matter. I prefer to put this all in terms of getting the model right. Here we have a data model and a prior distribution. The prior distribution can refer to the population of candidate markers, and it can be a sparsity prior such as the horseshoe prior (Laplace prior if you’re a lasso fan, but this prior has several disadvantages to the horseshoe prior). The prior chosen is independent of the number of features that you will be playing it against. The prior is used in a joint model of all candidate markers.

LikeLike

Frank,

Thank you for your comments. I am glad you are initiating some dialogue on what I have written. Your insights are ALWAYS welcome here!

I understand (I think) your argument against dividing the “total prior” into many individual priors. What you say – “The prior probability or prior belief that a certain assertion is true should come solely from the belief about that individual assertion.” – is very good advice. What you say can and should be done when dealing with a handful of biomarker hypotheses.

However, when there are many potential biomarker hypotheses that are being evaluated (especially in the early stages of drug development), it is extremely difficult to get scientists to evaluate and agree on an individual prior for each individual biomarker hypothesis when there are dozens if not hundreds of hypotheses. They can think in terms of overall probability that a biomarker hypothesis is false, and they can come to some reasonable consensus on an approximate point prior – i.e. there is a 20% chance that we will “find something” important among these 100 biomarkers – which is very valuable input for evaluating the final results. Even if they can be slightly specific by placing biomarkers into Categories or Tiers of prior belief, we have gone a long way in making subsequent inferences much more informative. [I must give former colleagues Lei Shen and Rick Higgs a shout-out for this practical approach that is quite understandable by scientists and quite useful to formal statistical inference.] As shown in the Table in Blog No. 8, the interpretation of the results hinges on such prior input and weighs heavily on the decision to proceed with or fund additional research on a biomarker.

So, the approach described in this blog is meant to give a reasonable idea of how likely it is that the null hypothesis is false using a posterior probability and to give a more realistic and accurate assessment than a p-value, no matter how highly significant that p-value appears.

LikeLike

Hi Steve,

I understand what you’re saying. I still feel that the prior for one assertion should not change if there are other assertions. A sparsity prior will let you specify the “population” of expected biomarker effects without having to give the population size.

LikeLike