No. 8: Let’s Get Real – Bayes and Biomarkers

Background

Once again, I start this blog with a reminder about Bayes Factor and its use in placing a probability on whether the null hypothesis is true or false. Obviously, knowing one gives us the other since

pr(H0 true) = 1 – pr(H0 false).

In blogs Nos. 6 and 7, I noted

“Using a point prior, there is a simple approximation for computing the probability of H0 being false using the Bayes Factor Bound (BFB), which is based on reasonable, practical assumptions [1]. Let p0 be the prior probability that H0 is false and let p=p-value from the test of H0 from the current experiment. Then the Bayes Factor Bound is

BFB=1/[-e * p * ln(p)],

and the upper bound on the posterior probability that H0 is false (p1) given the observed data is

p1 ≤  p0 * BFB/(1-p0+p0 * BFB)               (Equation 1).

I like this representation of knowledge – or at least the probability of that knowledge – on the probability scale rather than on the odds scale, which is what is often conveyed in publications like those from Jim Berger and colleagues [2, 3].”

Another way to write Eq. 1 is

p1< {1 + [(1-p0)/p0] / BFB }-1                (Equation 2).

This formulation clearly shows how the bound on the posterior probability of H0 being false is a function of the prior odds and the current data (encapsulated by the p-value in the BFB). This is precisely the heuristic for Bayesian thinking – using prior belief (preferably quantified as a probability or probability distribution) and current data (quantified as a p-value in this formulation) to create an updated probability (or at least its upper bound as in Eq. 2.) of the assertion of interest (i.e. whether a hypothesis is really false or not).

Biomarker Application

It is fairly common these days in the drug development process (especially oncology) to be interested in biomarkers. There are two categories of biomarkers – prognostic and predictive. Prognostic biomarkers are useful in assessing a patient’s outcome regardless of any treatment. For example, women with the BRCA gene are at much higher risk of developing breast or ovarian cancers. Predictive biomarkers are used to predict differential efficacy or safety response for patients taking a particular treatment. That is, for example, if the biomarker is present in a patient (known as marker positive or M+), then such patients will respond better (or worse) to a particular treatment than to a control treatment. This is a key element in personalized medicine or targeted therapeutics. The medical field can measure predictive biomarkers (if they exist) in a patient in order to find the best treatment for THAT patient.

So, the hunt for biomarkers (prognostic and predictive) is an extremely important area of medical research and drug development – but one that is fraught with difficulties since human biology is enormously complex. Thus, in many instances there may only be a vague notion of which biomarkers might be involved in a disease process and which treatments might be best for patients with a particular biomarker. It is not uncommon to explore dozens of biomarkers, and in the case of genetic markers, scientists/clinicians may want to search many thousands of genetic variants to see which, if any, can be useful predictive biomarkers.

Of course, this can lead to false positive findings due to multiple testing. The more biomarkers involved in the search process and the smaller the sample size of patients in the investigation, the more likely there will be false positive findings. I am sure that everyone who is reading this blog (especially those directly involved in drug development) knows stories of research findings that look so promising in the lab or in early, small clinical trials, only to fail upon more rigorous testing or evaluation in larger clinical trials.

So, why do so many findings turn out to be false? (Some have referred to this as a “reproducibility crisis” in science and I refer you to the initial paper – “Why Most Research Findings are False” [4] –  that opened this thinking to the scientific community) How do we determine what effects are real versus spurious? Or said more appropriately, what is the probability that an observed biomarker effect is real?

Here is a Bayesian perspective on this that I believe goes a long way toward explaining this problem. The basic premise herein is that there has been an over-reliance on p-values when we should be putting our attention on Bayesian probabilities. I will use a simple example, but one that is based in reality, to make the point.

Example – Contrasting p-values and Bayesian Probabilities

Suppose there is interest in assessing 100 biomarkers as potential predictive biomarkers (the same could be true for prognostic biomarkers). A study is done, and a hypothesis test is done for each biomarker regarding its predictive ability or “significance.” For one of the biomarkers, denoted B*, the hypothesis test produces a p-value of 0.0001. Now, normally that might seem amazing and strong evidence against the null hypothesis, but enlightened researchers know that such a p-value needs to be considered in light of some sort of adjustment for multiple testing. Let’s take the easiest AND most conservative approach (since I want to guard against false positive findings) – the Bonferroni correction.

This can be done easily in one of two ways. First, take the traditional significance level of the hypothesis test of 0.05 and divide by the number of tests – in this case 100. That gives an adjusted significance level of 0.0005. The observed p-value of 0.0001 is smaller than the adjusted significance level and therefore, is still statistically significant by the frequentist null hypothesis significance testing (NHST) approach even using the most conservative multiplicity adjustment method. The other and equivalent approach is to take the observed p-value and multiply it by the number of comparisons (i.e. 100) to get an adjusted p-value of 0.01, which is still considered highly significant and cause to reject the null hypothesis. Thus, B* seems to be a very  promising biomarker.

Now, we all know that the frequentist/NHST approach uses only the data from this singular experiment for making inference. If we wanted to use other information about these 100 biomarkers and incorporate it formally into our analysis, we would have to take a Bayesian approach. What prior information might we have and how would it be incorporated into our thinking and inference about a large array of biomarkers?

When taking a Bayesian perspective (one which I had taken in my former life at Lilly), there is a need to develop a prior for the 100 hypotheses related to each biomarker. A simple, yet VERY valuable and credible, way to do this is to talk to the scientists and get their opinions about individual biomarkers or the set of biomarkers collectively. There is an entire literature on “prior elicitation” in the Bayesian inference world [5] that I will not get into at this time, but essentially, that is what I am proposing herein. In a particular situation in my former work that was similar to this example and the motivation for this approach, a small group of scientists with deep expertise in the biology of the disease and the mechanism of action of the experimental treatment collectively agreed that there was about a 20% chance that one of these 100 biomarkers would be a predictive biomarker to distinguish responders and non-responders to the experimental treatment. Written more formally, the prior is

pr(at least one H0 of the 100 tested is false) = 0.20.

The scientists were not willing to bet on one biomarker or even a small set of biomarkers but interested in this entire panel of biomarkers related to the disease process. This low probability for a prior is quite reasonable given the highly exploratory nature of this research goal.

Since there are 100 biomarkers, each with equal chance of being predictive according to the scientists, we can calculate a prior for each H0 by dividing it uniformly over all biomarker hypotheses as

pr(any individual H0 is false) = 0.20 / 100 = 0.002.

Now, back to the data with a p-value of 0.0001 for B* and the use of Eq. 1 or Eq. 2 (they are equivalent), we can obtain the upper bound on the posterior probability of H0 being false as

p1 < 0.44.

That is, the probability of that our most significant biomarker (B*) in the experiment (p=0.01 even after multiplicity adjustment!) has less than a 50/50 chance of being truly predictive. That’s quite a different answer than what is conveyed by the frequentist approach. If that surprises you, then see Blog No. 5 “The Pr( You’re Bayesian) > 0.50” for a clear explanation of how different a p-value is from a Bayesian posterior probability. They are as different as pr(A|B) and pr(B|A) or in some concrete terms, as I like to say, pr(cloudy|rain) and pr(rain|cloudy). Clearly, it is the latter – a Bayesian probability – that is relevant.

We do not have to divide the overall prior of 0.20 equally over the 100 biomarker hypotheses if it is not warranted. For example, in some situations I have experienced, the scientists/physician may order the biomarkers according to their plausibility of being a predictive biomarker. This can be based on previous research, the published literature or exquisite knowledge of the disease or treatment. Thus, in this example, we could identify biomarkers B1-B10 as the most likely to be predictive. Then, biomarkers B11-B35 are remote possibilities for being predictive, and finally B36-B100 are purely exploratory. One way we could arithmetically divide the 0.20 prior is as follows:

  1. A prior of 0.10 is allocated to the most promising set of biomarkers and therefore B1-B10 individually get (0.10 / 10) = 0.01 as a prior for their hypothesis tests.
  2. A prior 0f 0.05 is allocated to the next most important tier of biomarkers and therefore B11-B35 individually get (0.05 / 25) = 0.002 as a prior for their hypothesis tests.
  3. The remaining prior of 0.05 is allocated to the exploratory set of biomarkers and therefore B36-B100 individually get (0.05 / 65) = 0.00077 as a prior for their hypothesis tests.

Now, we can calculate the upper bound on the posterior probability of B* really being a predictive biomarker (Eq. 1 or 2), but it depends on which of the three categories it falls into as described above.

Category Prior Probability that the Biomarker Effect is Real
(i.e. H0 is False)
Upper Bound on Posterior Probability that its Predictive Biomarker Effect is Real
(i.e. H0 is False)
1 0.01 0.80
2 0.002 0.44
3 0.00077 0.24

Thus, for the SAME LEVEL OF EVIDENCE from the current experiment, (p=0.0001), we have VERY DIFFERENT CONCLUSIONS about whether B* is really a predictive biomarker for the new treatment. That is quite reasonable. If B* were in Category 3, a highly significant finding (i.e. p=0.0001) would be quite surprising and should be regarded cautiously as a possible false positive finding. The advantage of the Bayesian approach is that it quantifies the probability of H0 being false. Said differently, there is at least a (1-0.24)=0.76 probability that B* is a false positive finding (i.e. H0 is really true). If B* were in Category 1 where there was some evidence external to the current experiment that it could be a predictive biomarker, then the same significant finding should be viewed as adding evidence to what is already believed/known and there is a notably higher probability that the finding is real – as much as 0.80.

I will note that there are more sophisticated and comprehensive approaches to allocating a prior across many hypotheses [6] that will not be covered in this blog that is intended for a more general scientific audience.

A Path Forward

So, what do we believe? We should believe in the Bayesian posterior probability, which is “calibrated” by our prior knowledge/belief/evidence. The range of possible posterior probabilities for asserting that we have identified a new predictive biomarker for our experimental treatment is quite wide, depending on the prior. Unless B* is in Category 1, that probability is low. Even the adjusted p-value of 0.01 by itself has very little meaning in this context and should NOT be interpreted as strong evidence against the null hypothesis (see Blog No. 7 for what a p-value is really worth).

Now, what should we do? For this argument, assume B* is in Category 2 or that we used a uniform prior across all 100 potential biomarkers (i.e. the prior is 0.002). In drug development, especially earlier stages and exploratory research, a probability of a new scientific finding of ~40% (note: the bound calculated from the BFB is reasonably sharp, at least in one in-depth examination [7]) could very well be enough evidence to move forward with additional research. Our decisions should be governed by some sort of utility function. If the next step in the research process costs $3 million dollars, then proceeding with the identified biomarker could be quite reasonable. If the next step in the research process costs $30 million dollars, then proceeding with the identified biomarker might be questionable. It also depends on the pay-off at the end of this research journey. If a successful research program results in a new drug for an unmet medical need or for a large population of patients for which the new treatment would be a better alternative to existing treatments, then even a $30 million investment may be worth the risk.

The point is this:

  • You must accurately quantify the probability of the real/ true state of nature (is this identified biomarker really predictive or not?) and base decisions on that probability, which is decidedly NOT the p-value (a measure of evidence from the current experiment). What is appropriate is the Bayesian posterior probability that combines prior knowledge with the current evidence.
  • To calculate the Bayesian posterior probability, a numerical prior is needed. This can be derived or described in many ways and is not as difficult as many opponents of Bayesian approaches state.
  • Decisions should be based on the probability that the finding is a false positive (or a false negative in some cases) as well as the costs (financial or human burden/safety) and benefits (financial or medical/societal need) that are quantified in a utility function.

References

[1]   Sellke, T., Bayarri, M. J., Berger, J. O. Calibration of p values for testing precise null hypotheses. The Amer. Statist. 55, 62-71 (2001).

[2]   Berger, J. O., and Sellke, T. (1987), Testing a Point Null Hypothesis: The Irreconcilability of p Values and Evidence. Journal of the American Statistical Association, 82, 112–122.

[3]   Benjamin, D., and Berger, J. (2019). Three Recommendations for Improving the Use of p-Values.  The American Statistician, 73:sup1, 186-191.

[4] Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8): e124.

[5] O’Hagan, A (2019). Expert knowledge elicitation: Subjective but scientific. The American Statistician, 73:sup1, 69-81.

[6] Berger J.O., Wang X., Shen L. (2014). A Bayesian approach to subgroup identification. J Biopharmaceutical Statistics, 24(1), 110-29.

[7] Bayarri, M. J., Benjamin, D., Berger, J., and Sellke, T. (2016). Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses. Journal of Mathematical Psychology, 72, 90–103.

3 thoughts on “No. 8: Let’s Get Real – Bayes and Biomarkers

  1. Wonderful post Steve. The only thing I’ll argue with is the way you derived a multiplicity-adjusted prior. The prior probability or prior belief that a certain assertion is true should come solely from the belief about that individual assertion. The number of other assertions evaluated should not matter. I prefer to put this all in terms of getting the model right. Here we have a data model and a prior distribution. The prior distribution can refer to the population of candidate markers, and it can be a sparsity prior such as the horseshoe prior (Laplace prior if you’re a lasso fan, but this prior has several disadvantages to the horseshoe prior). The prior chosen is independent of the number of features that you will be playing it against. The prior is used in a joint model of all candidate markers.

    Like

    1. Frank,
      Thank you for your comments. I am glad you are initiating some dialogue on what I have written. Your insights are ALWAYS welcome here!

      I understand (I think) your argument against dividing the “total prior” into many individual priors. What you say – “The prior probability or prior belief that a certain assertion is true should come solely from the belief about that individual assertion.” – is very good advice. What you say can and should be done when dealing with a handful of biomarker hypotheses.

      However, when there are many potential biomarker hypotheses that are being evaluated (especially in the early stages of drug development), it is extremely difficult to get scientists to evaluate and agree on an individual prior for each individual biomarker hypothesis when there are dozens if not hundreds of hypotheses. They can think in terms of overall probability that a biomarker hypothesis is false, and they can come to some reasonable consensus on an approximate point prior – i.e. there is a 20% chance that we will “find something” important among these 100 biomarkers – which is very valuable input for evaluating the final results. Even if they can be slightly specific by placing biomarkers into Categories or Tiers of prior belief, we have gone a long way in making subsequent inferences much more informative. [I must give former colleagues Lei Shen and Rick Higgs a shout-out for this practical approach that is quite understandable by scientists and quite useful to formal statistical inference.] As shown in the Table in Blog No. 8, the interpretation of the results hinges on such prior input and weighs heavily on the decision to proceed with or fund additional research on a biomarker.

      So, the approach described in this blog is meant to give a reasonable idea of how likely it is that the null hypothesis is false using a posterior probability and to give a more realistic and accurate assessment than a p-value, no matter how highly significant that p-value appears.

      Like

      1. Hi Steve,
        I understand what you’re saying. I still feel that the prior for one assertion should not change if there are other assertions. A sparsity prior will let you specify the “population” of expected biomarker effects without having to give the population size.

        Like

Leave a comment