Everything that you are about to read is NOT controversial, though it may be surprising to you. In the end I hope that it fundamentally changes your thinking or understanding about how to interpret a clinical trial result.
Let’s get started.
Experiments as Diagnostic Tests
The Merriam-Webster definition of a bioassay is the “determination of the relative strength of a substance (such as a drug) by comparing its effect on a test organism with that of a standard preparation.” In this regard, a clinical trial is nothing more than a very sophisticated bioassay, and indeed, any research experiment may be considered as an assay. It is the attempt to quantify an unknown characteristic of a substance or organism through chemical or biological analysis.
The design, operating characteristics and interpretation of a diagnostic test are well-known and serve as an excellent analogy for clinical trials [1, 2] and experimental research in general. This is clearly depicted in the familiar 2×2 table shown in Figure 1.
The analogy of diagnostic testing (bold font) and null hypothesis significance testing (NHST) (italicized font).
In both diagnostic testing and experimental research, the goal is to make inference about an unknown characteristic of the patient (i.e. is the patient pregnant?) or an unknown truth about the state of nature (i.e. does this drug work?). The diagnostic test is designed to have suitable sensitivity (i.e. ability to identify patients with the characteristic) and specificity (i.e. ability to identify those without the characteristic). In the NHST paradigm, a statistical test is designed to have adequate power (i.e. ability to detect an effect if it exists) while controlling the Type 1 error (i.e. limit false positive findings). Of course, the Type 2 error (denoted β) is (1-power) and is referred to as a false negative finding. The sensitivity and specificity that constitutes an acceptable diagnostic test depends on the costs or consequences of the probability of false positive and false negative findings. At its inception, NHST was also to “decide about α, β, and sample size before the experiment, based on subjective cost-benefit considerations” (my emphasis added) . This seems to have been lost in the modern mindless application of NHST whereby α=0.05 and β=(0.80, 0.90) in most applications, regardless of the scientific problem or societal circumstance.
In both diagnostic testing and NHST, the design of the tests depends on the conditional existence or non-existence of the characteristic or truth. Sensitivity, specificity, Type 1 Error, Power (and equivalently Type 2 Error) are conditional probabilities that operate in the “vertical direction” of the 2×2 table in Figure 1. That is, sensitivity and power start with an assumption of the characteristic being present or a positive effect while specificity and Type 1 error start with the assumption of the characteristic being absent or there being a null effect. In diagnostic testing, the sensitivity and specificity can be tuned by changing the cut-off value for defining a positive/negative result, and most often various cut-offs are used to optimize sensitivity and specificity appropriately for a medical condition. In NHST, the sample size may be calculated to meet α=0.05 and β=(0.80, 0.90), but rarely are α and β changed. Furthermore, α is used solely to define the critical value (i.e. cut-off) that demarcates the rejection region for the hypothesis test (e.g. 1.96 for α=0.05).
Thus, the concepts underlying the design of a diagnostic test are identical to those for designing a clinical trial or research experiment under the NHST paradigm.
Clinical Trial Interpretation
So, how should one interpret the outcome of a single experiment or clinical trial? If we follow the diagnostic testing analogy, we need to understand the population prevalence of the characteristic that is unknown in the individual patient. The important quantity is the Positive Predictive Value (PPV) which is the conditional probability of the patient having the characteristic given (i.e. assuming) the test result is positive. This conditional probability operates in the “horizontal direction” of the 2×2 table in Figure 1 and is computed by evaluating the fraction of true positives relative to all positive findings (i.e. true and false positives). For example, if sensitivity is written as
pr(the diagnostic test is positive|the patient has the characteristic) = pr(A|B),
then PPV is the inverse probability
pr(the patient has the characteristic|the diagnostic test is positive) = pr(B|A).
It is only this latter probability that is meaningful to the physician and patient. Similar statements can be written for negative predictive value (NPV) as
pr(the patient does NOT have the characteristic|the diagnostic test is negative).
In NHST, prevalence is directly analogous to the likelihood that Ha is true – or equivalently H0 is false – in a research experiment (i.e. the prior). The inverse probabilities for PPV and NPV are provided to us by none other than Bayes Theorem, but are rarely considered in the NHST paradigm. Yet, no one with any understanding of diagnostic testing would ever conclude that there is a 95% probability that a patient has a characteristic of interest if they tested positive when using a diagnostic test with 95% sensitivity! Why then, when we do a sophisticated “diagnostic test” such as a research experiment or clinical trial, do we conclude we have a positive finding if the observed result is a p-value<0.05 just because we designed the assay with a significance level of 0.05? The interpretation of an individual research experiment must be based on the posterior probability of the alternative (or conversely the null) hypothesis being true in the same way that the interpretation of any individual result from a diagnostic test can only be interpreted using PPV or NPV.
As noted in the Introduction, there have been long-standing debates on the use of frequentist and Bayesian approaches to inference. At times, the arguments on both sides have been quite philosophical or abstract or mathematical, leaving many practitioners and consumers of statistical information confused and alienated, even to the point of abandoning their use.
In fact, the frequentist and Bayesian approaches could be harmonized by taking a diagnostic testing mindset. The design of a research experiment can be done using the interplay between the α-level and power of the statistical test with resulting sample size and critical value to optimize the performance of the experiment or trial. However, when interpreting the results of the experiment, a Bayesian evaluation is most appropriate. The additional requirements and complexity of the Bayesian approach lies in the quantification of a prior for H0 being false, or equivalently Ha being true.
In its simplest form, the prior could be stated as a point probability – a single number in the interval (0,1). For example, for a Phase 2 clinical trial of a new treatment, one may argue the probability that the new treatment works is 0.30. This may be derived from historical data on such treatments in this therapeutic class, preclinical models of disease, pharmacokinetic/pharmacodynamic models, the success/failure of other treatments in the same mechanistic class or other sources of scientific knowledge. There is a full literature on rigorous, scientific elicitation and construction of a prior for a hypothesis of interest.
Using a point prior, there is a simple approximation for computing the probability of H0 being false using the Bayes Factor Bound (BFB) which is based on reasonable, practical assumptions . Let p0 be the prior probability that H0 is false and let p=p-value from the test of H0 from the current experiment. Then the Bayes Factor Bound is
and the upper bound on the posterior probability that H0 is false (p1) given the observed data is
p1 ≤ p0*BFB/(1-p0+p0*BFB) (Equation 1).
In words, this formula contains the prior probability that the null hypothesis is false and the current level of evidence against the null hypothesis to update the probability of H0 being false after the experiment. This posterior probability is directly related to our belief against H0, which is decidedly NOT what a p-value is, and it is more understandable and interpretable. The p-value and the posterior probability are as distinct as sensitivity and PPV, and we cannot think of the Bayesian posterior probability as (1 – p-value) since we all know that pr(B|A) ≠ 1 – pr(A|B). Finally, note that this posterior can be used in constructing a prior for subsequent experimentation and hypotheses of the same or similar nature.
To complete the example, with a prior probability of 0.30 that an experimental drug works in a clinical trial (i.e. H0 is false), suppose our hypothetical clinical trial produces a p-value of 0.05. Using Equation 1, the posterior probability that H0 is false is less than or equal to 0.513. There are more sophisticated approaches that use a full probability distribution of effect size rather than a point probability as a prior, but that is beyond this paper. However, the concepts and principles are identical. Equation 1 may serve as a quick or approximate assessment of the likelihood of H0 being false. Individual scientists may have different priors based on their knowledge, experience or even bias, leading to different levels of posterior belief. That is OK. What is important is to discuss the sources of prior data and information rather than fixate on a single p-value from the current experiment.
As noted when I started this blog, the fact that the PPV and NPV are the primary quantities of interest for interpreting the results of a diagnostic test is undisputed and commonplace. The analogy of a diagnostic test to a research experiment, including clinical trials, is evident. Yet, the interpretation of the vast majority of scientific research ignores PPV (a Bayesian posterior probability) in favor of sensitivity (significance level). Are you shocked and surprised enough to retire statistical significance? I hope so!!!
- Diamond, G. A., Forrester, J. S. Clinical trials and statistical verdicts: Probable grounds for appeal. Ann Int Med 98, 385-394 (1983).
- Browner W. S., Newman T. B. Are all significant p values created equal? The analogy between diagnostic tests and clinical research. JAMA 257, 2459–2463 (1987).
- Gigerenzer, G. Mindless Statistics. J Socio-Econ. 33, 587-606 (2004).
- Sellke, T., Bayarri, M. J., Berger, J. O. Calibration of p values for testing precise null hypotheses. The Amer. Statist. 55, 62-71 (2001).