Everything that you are about to read is NOT controversial, though it may be surprising to you. In the end I hope that it fundamentally changes your thinking or understanding about how to interpret a clinical trial result.

Let’s get started.

**Experiments
as Diagnostic Tests**

The Merriam-Webster definition of a bioassay is the “determination of the relative strength of a substance (such as a drug) by comparing its effect on a test organism with that of a standard preparation.” In this regard, a clinical trial is nothing more than a very sophisticated bioassay, and indeed, any research experiment may be considered as an assay. It is the attempt to quantify an unknown characteristic of a substance or organism through chemical or biological analysis.

The design, operating characteristics and interpretation of a diagnostic test are well-known and serve as an excellent analogy for clinical trials [1, 2] and experimental research in general. This is clearly depicted in the familiar 2×2 table shown in Figure 1.

**Figure
1**

The analogy of diagnostic testing (**bold
font**) and null hypothesis significance testing (NHST) (*italicized font*).

In both diagnostic testing and experimental research, the goal is to make inference about an unknown characteristic of the patient (i.e. is the patient pregnant?) or an unknown truth about the state of nature (i.e. does this drug work?). The diagnostic test is designed to have suitable sensitivity (i.e. ability to identify patients with the characteristic) and specificity (i.e. ability to identify those without the characteristic). In the NHST paradigm, a statistical test is designed to have adequate power (i.e. ability to detect an effect if it exists) while controlling the Type 1 error (i.e. limit false positive findings). Of course, the Type 2 error (denoted β) is (1-power) and is referred to as a false negative finding. The sensitivity and specificity that constitutes an acceptable diagnostic test depends on the costs or consequences of the probability of false positive and false negative findings. At its inception, NHST was also to “decide about α, β, and sample size before the experiment, ** based on subjective cost-benefit considerations**” (my emphasis added) [3]. This seems to have been lost in the modern mindless application of NHST whereby α=0.05 and β=(0.80, 0.90) in most applications, regardless of the scientific problem or societal circumstance.

In both diagnostic testing and NHST, the design of the tests depends on the conditional existence or non-existence of the characteristic or truth. Sensitivity, specificity, Type 1 Error, Power (and equivalently Type 2 Error) are conditional probabilities that operate in the “vertical direction” of the 2×2 table in Figure 1. That is, sensitivity and power start with an assumption of the characteristic being present or a positive effect while specificity and Type 1 error start with the assumption of the characteristic being absent or there being a null effect. In diagnostic testing, the sensitivity and specificity can be tuned by changing the cut-off value for defining a positive/negative result, and most often various cut-offs are used to optimize sensitivity and specificity appropriately for a medical condition. In NHST, the sample size may be calculated to meet α=0.05 and β=(0.80, 0.90), but rarely are α and β changed. Furthermore, α is used solely to define the critical value (i.e. cut-off) that demarcates the rejection region for the hypothesis test (e.g. 1.96 for α=0.05).

Thus, the concepts underlying the design of a diagnostic
test ** are identical** to those for designing a clinical trial or
research experiment under the NHST paradigm.

**Clinical
Trial Interpretation**

So, how should one interpret the outcome of a single experiment or clinical trial? If we follow the diagnostic testing analogy, we need to understand the population prevalence of the characteristic that is unknown in the individual patient. The important quantity is the Positive Predictive Value (PPV) which is the conditional probability of the patient having the characteristic given (i.e. assuming) the test result is positive. This conditional probability operates in the “horizontal direction” of the 2×2 table in Figure 1 and is computed by evaluating the fraction of true positives relative to all positive findings (i.e. true and false positives). For example, if sensitivity is written as

pr(the diagnostic test is positive|the patient has the characteristic) = pr(A|B),

then PPV is the inverse probability

pr(the patient has the characteristic|the diagnostic test is positive) = pr(B|A).

**It is only this latter probability that is meaningful to
the physician and patient.** Similar
statements can be written for negative predictive value (NPV) as

pr(the patient does NOT have the characteristic|the diagnostic test is negative).

In NHST, prevalence is directly analogous to the likelihood
that H_{a} is true – or equivalently H_{0} is false – in a
research experiment (i.e. the prior). The inverse probabilities for PPV and NPV
are provided to us by none other than Bayes Theorem, but are rarely considered
in the NHST paradigm. Yet, no one with any understanding of diagnostic testing
would ever conclude that there is a 95% probability that a patient has a
characteristic of interest if they tested positive when using a diagnostic test
with 95% sensitivity! **Why then, when we do a sophisticated “diagnostic test”
such as a research experiment or clinical trial, do we conclude we have a
positive finding if the observed result is a p-value<0.05 just because we
designed the assay with a significance level of 0.05?** **The interpretation
of an individual research experiment must be based on the posterior probability
of the alternative (or conversely the null) hypothesis being true in the same
way that the interpretation of any individual result from a diagnostic test can
only be interpreted using PPV or NPV.**

**Détente**

As noted in the Introduction, there have been long-standing debates on the use of frequentist and Bayesian approaches to inference. At times, the arguments on both sides have been quite philosophical or abstract or mathematical, leaving many practitioners and consumers of statistical information confused and alienated, even to the point of abandoning their use.

In fact, the frequentist and Bayesian approaches could be
harmonized by taking a diagnostic testing mindset. The design of a research
experiment can be done using the interplay between the α-level and power of the
statistical test with resulting sample size and critical value to optimize the
performance of the experiment or trial. However, when interpreting the results
of the experiment, a Bayesian evaluation is most appropriate. The additional
requirements and complexity of the Bayesian approach lies in the quantification
of a prior for H_{0} being false, or equivalently H_{a} being
true.

In its simplest form, the prior could be stated as a point probability – a single number in the interval (0,1). For example, for a Phase 2 clinical trial of a new treatment, one may argue the probability that the new treatment works is 0.30. This may be derived from historical data on such treatments in this therapeutic class, preclinical models of disease, pharmacokinetic/pharmacodynamic models, the success/failure of other treatments in the same mechanistic class or other sources of scientific knowledge. There is a full literature on rigorous, scientific elicitation and construction of a prior for a hypothesis of interest.

Using a point prior, there is a simple approximation for
computing the probability of H_{0} being false using the Bayes Factor Bound (BFB) which is based
on reasonable, practical assumptions [4]. Let p_{0} be the prior
probability that H_{0} is false and let p=p-value from the test of H_{0}
from the current experiment. Then the Bayes Factor Bound is

BFB=1/[-e*p*ln(p)],

and the upper bound on the posterior probability that H_{0}
is false (p_{1}) given the observed data is

p_{1} ≤ p_{0}*BFB/(1-p_{0}+p_{0}*BFB)
(Equation 1).

In words, this formula contains the prior probability that the null hypothesis is false and the current level of evidence against the null hypothesis to update the probability of H_{0} being false after the experiment. This posterior probability is directly related to our belief against H_{0}, which is decidedly **NOT** what a p-value is, and it is more understandable and interpretable. The p-value and the posterior probability are as distinct as sensitivity and PPV, and we cannot think of the Bayesian posterior probability as (1 – p-value) since we all know that pr(B|A) ≠ 1 – pr(A|B). Finally, note that this posterior can be used in constructing a prior for subsequent experimentation and hypotheses of the same or similar nature.

To complete the example, with a prior probability of 0.30
that an experimental drug works in a clinical trial (i.e. H_{0} is
false), suppose our hypothetical clinical trial produces a p-value of 0.05.
Using Equation 1, the posterior probability that H_{0} is false is less
than or equal to 0.513. There are more sophisticated approaches that use a full
probability distribution of effect size rather than a point probability as a
prior, but that is beyond this paper. However, the concepts and principles are
identical. Equation 1 may serve as a quick or approximate assessment of the
likelihood of H_{0} being false. Individual scientists may have
different priors based on their knowledge, experience or even bias, leading to
different levels of posterior belief. That is OK. What is important is to
discuss the sources of prior data and information rather than fixate on a
single p-value from the current experiment.

As noted when I started this blog, the fact that the PPV and NPV are the primary quantities of interest for interpreting the results of a diagnostic test is undisputed and commonplace. The analogy of a diagnostic test to a research experiment, including clinical trials, is evident. Yet, the interpretation of the vast majority of scientific research ignores PPV (a Bayesian posterior probability) in favor of sensitivity (significance level). Are you shocked and surprised enough to retire statistical significance? I hope so!!!

**References**

- Diamond, G. A., Forrester, J. S. Clinical trials
and statistical verdicts: Probable grounds for appeal.
*Ann Int Med*98, 385-394 (1983). - Browner W. S., Newman T. B. Are all significant
p values created equal? The analogy between diagnostic tests and clinical
research.
*JAMA*257, 2459–2463 (1987). - Gigerenzer, G. Mindless Statistics.
*J Socio-Econ.*33, 587-606 (2004). - Sellke, T., Bayarri, M. J., Berger, J. O.
Calibration of p values for testing precise null hypotheses.
*The Amer. Statist.*55, 62-71 (2001).