No. 10 – Always do Subgroup IDENTIFICATION

Note: This blog is a little longer than usual because it challenges some traditional notion of statistical thinking about subgroups, and I include a Case Study to explicitly make some key points about my thinking when evaluating subgroups in clinical trials.

Background and Definitions

I recently presented in a webinar hosted by the National Institute of Statistical Science and Merck entitled “Subgroup Analysis” hosted by Dan Holder of Merck. The entire webinar audio can be found at https://www.niss.org/events/niss-merck-virtual-meet-subgroup-analysis. Other speakers were Ilya Lipkovich and Rob Hemmings. I will note that their talks were very good. You should give it a listen. As for my talk, I will let you be the judge if you want to give it a listen. What I am writing here provides greater emphasis on one part of my talk in that webinar.

Sir Richard Peto has been heard to say, “Always do subgroup analysis, but never believe them.”

Subgroup analysis has received many negative or cautionary commentaries (1, 2 are just two examples with long lists of additional references), and it would be difficult to give an exhaustive review of the many perspectives on this topic. To address these concerns and balance them against this blog title, I must first make a distinction between ‘subgroup analysis’ and ‘subgroup identification.’

In (2), the authors state, “By ‘subgroup analysis,’ we mean any evaluation of treatment effects for a specific end point in subgroups of patients defined by baseline characteristics.” Most of the articles dedicated to this topic are explicitly or implicitly discussing post hoc analysis: (a) routine assessment for heterogeneity of treatment effects; (b) sometimes for exploratory evaluation purposes; and (c) sometimes to “salvage” a failed trial. That latter is when the primary efficacy result is not significant in the overall population enrolled in the trial, but there may be a subgroup of patients for which the treatment effect is medically meaningful or statistically significant. Various authors have provided rules, guidelines or checklists for subgroup analysis (3, 4, 5, 6).

By subgroup identification, [I believe the phrase was first mentioned in (7)], I mean the systematic, pre-planned, statistically rigorous evaluation of subgroups defined by baseline covariates to make valid inferences. Built into this definition is the notion of “valid inferences” that goes beyond the post hoc or exploratory subgroup analysis approach. Subgroup identification is characterized by what I have called Disciplined Subgroup Search (DSS) which is defined in (8), and has the following six characteristics:

  1. Prespecification: the algorithm/methodology to be used for identifying subgroups, the list of baseline characteristics/biomarkers that form the covariate space to be searched, complexity of subgroup definitions (i.e., how many covariates are allowed to define the subgroup), as well as any other options/decisions that can be made in the analysis process. (In short, this is no different than prespecification of any important analysis in a Phase 3 trial that adheres to the ICH-E9 Guideline.)
  2. Adjusting for multiplicity: how statistical significance (i.e., p-values) of a subgroup finding will be adjusted for multiplicity. [Also, Bayesian approaches will be noted further into this blog.]
  3. Bias correction: how estimates of treatment effect are corrected for bias due to the selection bias associated with searching multiple subgroups.
  4. Biomarker effects: allows for separating prognostic biomarker effects from predictive biomarker effects.
  5. Interactions: allows for multiple biomarkers to be included in the definition of a subgroup – i.e. not multiple one-variable-at-a-time analyses.
  6. Partition: allows for identification of a cut-off value for a continuous biomarker that separates smaller treatment effects from larger treatment effects.

With a clear, intentional approaches described by the six characteristics above, I am now prepared to argue why one should always do subgroup identification and how it renders subgroup analysis unnecessary and unadvisable, perhaps even irrelevant. There are two scenarios for subgroup identification both of which require the use of DSS.

1. Homogeneity of Treatment Effect

One scenario is to evaluate the homogeneity or consistency of a treatment effect across subgroups. This is typically done for well-defined patient characteristics of gender, race, geographic region as well as, in general, a set of medical characteristics or disease state parameters (e.g. disease severity, etiology, previous treatments). Traditional analyses are done one-variable-at-a-time and displayed as tornado diagrams or tables comparing the treatment effect in a subgroup and its complement. A typical example is below for Toprol-XL, taken from the US FDA label (9).

There are two subgroups that are highlighted where one might see a suspected differential treatment effect – US vs Non-US populations and male vs female. In response to these findings, the US label for Toprol-XL states the following (my emphasis is noted in bold):

“The figure … illustrates principal results for a wide variety of subgroup comparisons, including US vs. non-US populations (the latter of which was not pre-specified). The combined endpoints of all-cause mortality plus all-cause hospitalization and of mortality plus heart failure hospitalization showed consistent effects in the overall study population and the subgroups, including women and the US population. However, in the US subgroup and women, overall mortality and cardiovascular mortality appeared less affected. Analyses of female and US patients were carried out because they each represented about 25% of the overall population. Nonetheless, subgroup analyses can be difficult to interpret, and it is not known whether these represent true differences or chance effects.”

Now, the subgroups defined in the above Figure were likely pre-planned or specified in the protocol, and if not, they are still generally a familiar or recognizable list. The problem is that there was likely no disciplined approach to their assessment as proposed by DSS. Subsequently the label, which should be informative to physicians and patients, is left with ambiguous statements. It is as if the Sponsor and FDA settled on something like, “We observed a difference in treatment effect in a couple of the many subgroups explored, but we don’t know whether they are real or spurious. We can’t figure it out, so go figure it out for yourself.” The official labeling statements and my parody are useless, and if the public cannot rely on learned people – who have access to all the data and much more data on many, many other drugs and their gender effects – to give a better explanation, the how can we expect them to know what to believe or decide.

What if the Sponsor would have taken a DSS approach from which reliable estimates of treatment effects in these subgroups and their statistical significance evaluated? There were ten baseline covariates evaluated that could have been pre-specified in the Statistical Analysis Plan (SAP). Any subgroup identification tool could have been used as long as it adjusted the p-values for multiplicity and corrected the bias in treatment effect estimates due to multiplicity. I refer you to a review article and other materials/tools from Ilya Lipkovich and Alex Dmitrienko (10, 11). With the DSS approach, more accurate, reliable estimates of the treatment effect in each subgroup could have been produced resulting in more definitive conclusions and advice for physicians and patients.

I have seen numerous other labels with confusing statements about subgroups (e.g. belimumab, BENLYSTA®, use in Black/African American Patients), and my own personal experience in pharma companies includes dealing with situations when there is a differential treatment effect in some subgroup of 10-20 that are evaluated. Usually, this resulted in an inordinate amount of time trying to explain what is likely to be or appears to be a spurious significant result. When these explanations are done post hoc or ad hoc, which they usually are, it’s too late to make definitive assertions.

2. Heterogeneity of Treatment Effects

The second scenario is when there is a desire to have more targeted therapeutics in which a subgroup of patients with measurable characteristics can be identified to have an exceptional efficacy response or increased risk for adverse effects. The same DSS approach can be taken to attain more credible results in a subgroup of interest, even in a so-called failed trial when there is no significant treatment effect in the overall population,.

My example here is for ramucirumab for hepatocellular carcinoma (HCC) and its Phase 3 clinical trial (REACH) of ramucirumab versus placebo on top of best supportive care. The results of the trial for the primary endpoint resulted in a hazard ratio of 0.87 in overall survival for the ramucirumab vs placebo ([95% CI 0·72–1·05]; p=0·14). Thus, the study was a failure. The article notes that there were 10 pre-specified subgroups. One subgroup was defined by a biomarker called a-fetoprotein (AFP) and whether its value was <400ng/ml or >400ng/ml. The article did note that an AFP cut-off of 400 ng/ml was pre-specified because it is commonly used in prognostic scoring systems for HCC. As it turns out, this biomarker-defined subgroup displayed a differential treatment effect that was noteworthy (interaction p-value=0.027), not only for its magnitude but also for its biological plausibility. When analyzing the patient subgroup with AFP>400ng/ml, there was a significantly longer survival on ramucirumab versus placebo (HR=0.67 for overall survival; p=0.006).

Based on these findings, the Sponsor did a follow-up study (REACH-2) that enrolled only the enriched population of patients with AFP>400ng/ml (13). That study confirmed the overall survival benefit of ramucirumab versus placebo in patients with baseline AFP>400ng/ml (HR=0.71; p-value=0.0199). [I am intrigued at the official reporting of a p-value of 0.0199 as opposed to 0.02. Certainly, a p-value that has 0.01 in it seems to be quite strong and is probably remembered better … but that’s just me.] So, the bottom line is that ramucirumab (CYRAMZA®) was approved for this indication by FDA in May, 2019.

So, this sounds like a success story, and it is … but wait! What if DSS had been formally done in REACH? What if the AFP subgroup was formally included in a pre-specified analysis plan along with other subgroups using DSS? What if a specific subgroup identification search methodology was pre-specified? What if adjusted p-values and effect estimates were calculated? What if they were still significant and meaningful? Could ramucirumab have been approved in the AFP>400ng/ml subgroup based on REACH in 2015 using DSS instead of

  • Spending 3 years, and
  • Many, many millions of dollars, and
  • Tens of thousands of patients not having access to an effective medication?

To make this happen, one would have had to use a multiple comparison procedure that allocated, say a=0.04, to the primary analysis and a=0.01 to the DSS analysis. With a treatment-by-AFP interaction p-value of 0.027 in the first study, REACH, it still would not have been significant, thereby necessitating another confirmatory trial. HOWEVER, it seems that the Sponsor took a big risk betting on such a marginal interaction p-value in a set of 10 subgroup analyses. If a DSS analysis had been done, then a more objective assessment of the significance of the treatment effect in this subgroup could have been computed as well as an unbiased (or less biased) estimate of the treatment effect resulting in a more- or better-informed decision!

So, to counter Sir Richard Peto, in my humble perspective: “Always do subgroup identification using DSS so the results are more believable.”

As noted above, there were other considerations such as biological plausibility that made this subgroup more appealing, and the article does describe analyses of different cut-off values for defining the AFP subgroup that showed a consistent pattern of improved survival with increasing baseline AFP.

But wait … other considerations? Biological plausibility? Sounds to me like a prior!

Bayesian Thinking

As described in Blog No. 8: Let’s Get Real – Bayes and Biomarkers, one could assign a prior to each biomarker and then use Bayes Factor to help compute a posterior probability for the hypothesis that AFP is truly a predictive biomarker. Because of the known strong, prognostic effects of AFP and its link to the biology of HCC, one might have been willing to give it a higher prior probability of being a predictive biomarker. Table 1 shows various priors and two columns of posteriors probabilities. The first posterior column is based on the interaction p-value of 0.027. The second posterior column comes from the p-value of 0.006 based on the test of treatment effect (ramucirumab vs placebo) for overall survival within the AFP>400ng/ml subgroup alone. This latter analysis is intended to be more indicative of what might be seen in a subsequent trial in this population, though one must recognize the observed results are quite likely optimistic due to selection bias.

Table 1

Posterior probabilities that AFP>400ng/ml is a predictive biomarker for ramucirumab treatment effect in hepatocellular carcinoma.

Prior Probability Upper Bound on Posterior Probability based on Interaction Test (p=0.027) Upper Bound on Posterior Probability based on Treatment Effect in Subgroup with AFP>400ng/ml (p=0.006)
0.20 0.485 0.750
0.30 0.617 0.837
0.40 0.715 0.889
0.50 0.790 0.923

Posterior probabilities (p1) computed from prior probabilities (p0) and the stated p-values in the Table using p1< {1 + [(1-p0)/p0] / BFB }-1 where BFB is Bayes Factor Bound.

Armed with these posterior probabilities (albeit upper bounds), the decision to proceed with another VERY expensive trial for multiple years seem more comfortable (at least to me). At least such probabilities are more explicitly addressing the question of interest (i.e. is AFP>400ng/ml truly a predictive biomarker?) rather than post hoc p-values for which there is a known optimistic selection bias, but unquantifiable without using DSS or a Bayesian approach.

So, I can go yet another step in modifying Sir Richard Peto’s perspective and I say, “Always use Bayesian thinking when doing subgroup identification so you can quantify how believable the results are.”

In fact, in the Toprol-XL label, for example, I suspect the prior on there being a male/female differential treatment effect would have been quite low – perhaps as low as 0.05 or 0.01. Subsequently, the upper bound on the posterior probability of that effect would have also been quite low, unless there was a very small p-value (i.e. less than 0.001). I do not know the p-value for that interaction test which is not reported in the original publication. However, it is worth noting that the observed US/non-US and male/female differential treatment effects were not even mentioned in the original article, so I suspect the p-values were somewhat inconsequential. Again, had this been done using DSS or a Bayesian approach, a more quantifiable answer could have been obtained and a more clear conclusion could have been represented in the label rather than the ambiguous, “non-statement” that is currently there.

Recommendations

  1. So, always do subgroup identification using DSS!
    • This can be done for “defensive purposes” when the goal is to show homogeneity of treatment effect and to avoid consternation (and endless label haggling with regulators) trying to “explain away” what appear to be spurious findings. It can also help make labeling statement less ambiguous and more informative to physicians and patients.
    • When done for “offensive purposes” – targeted therapeutics or rescuing a “failed trial” – one has to consider using a Statistical Analysis Plan that allocates some of its Type 1 Error Rate control (alpha) to DSS. This could be a split of alpha=0.04 to the primary analysis and alpha=0.01 to DSS. The slight reduction in a for the primary analysis does require an increase in sample size to maintain power. Alternatively, the sample size could remain the same with the recognition that the Sponsor is giving up some power on the primary analysis in order to gain power for finding a targeted therapeutic. This requires thoughtful analysis of the subgroup size, the potential differential treatment effect size in the subgroup and perhaps other factors in order to optimize the overall chances of a successful trial.
  2. Interpretation of subgroup findings, no matter how they are achieved, requires Bayesian thinking. In Blog No. 5: pr(You’re Bayesian)>0.50, I argue (and demonstrate) that one cannot interpret a p-value by itself. It requires a prior. More formal use of Bayesian thinking could be implemented even if using a crude approach to declare your prior that each protocol-specified subgroup might have a differential treatment effect. This does not completely solve the problem, but again, it minimizes post hoc confusion and arguments over what to believe is real or spurious.

Finally, in a subsequent blog, I will tackle the notion of subgroups defined by post-randomization effects, which is another “No-No” in the orthodoxy of statistical practice. Yet, this is what physicians do in treating patients and ‘personalizing medicine.’ Until then, make it a stellar day.

References

  1. Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA 1991;266:93-98.
  2. Rui Wang, M.S., Stephen W. Lagakos, Ph.D., James H. Ware, Ph.D., David J. Hunter, M.B., B.S., and Jeffrey M. Drazen, M.D. Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials. N Engl J Med 2007; 357:2189-2194.
  3. Brookes ST, Whitley E, Peters TJ, Mulheran PA, Egger M, Davey Smith G. Subgroup analyses in randomized controlled trials: quantifying the risks of false-positives and false-negatives. Health Technology Assessment 2001, 5(33).
  4. Rothwell PM. Subgroup analysis in randomized controlled trials: importance, indications, and interpretation. Lancet 2005; 365:176–86.
  5. Sun X, Briel M, Walter SD, Guyatt GH. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses, BMJ 2010; 340:c117doi: 10.1136/bmj.c117.
  6. Burke, J F,  Sussman, J B, Kent,  D M, Hayward , R A. Three simple rules to ensure reasonably credible subgroup analyses. BMJ 2015;351:h5651 doi: 10.1136/bmj.h5651.
  7. Foster, J C, Taylor, J M G, Ruberg, S J. Subgroup identification from randomized clinical trial data. Stats in Med 2011, 30: 2867-2880.
  8. Ruberg, SJ, Shen, L. Personalized Medicine: Four Perspectives of Tailored Medicine. Stats in Biopharm Res 2015. 7:3, 214-229.
  9. https://www.accessdata.fda.gov/drugsatfda_docs/label/2009/019962s038lbl.pdf (accessed 27 Sep 2019).
  10. Lipkovich, I, Dmitrienko, A, D’Agostino Sr., RB. Tutorial in biostatistics: data‐driven subgroup identification and analysis in clinical trials. Stats in Med 2017, 36, 136-196.
  11. http://biopharmnet.com/subgroup-analysis-software/ (accessed 27 Sep 2019).
  12. Zhu, AX, et al. Ramucirumab versus placebo as second-line treatment in patients with advanced hepatocellular carcinoma following first-line therapy with sorafenib (REACH): a randomised, double-blind, multicentre, phase 3 trial. Lancet Onc 2015; 16, 859-870.
  13. Zhu, AX, et al. Ramucirumab after sorafenib in patients with advanced hepatocellular carcinoma and increased α-fetoprotein concentrations (REACH-2): a randomised, double-blind, placebo-controlled, phase 3 trial. Lancet Onc 2019; 20, 282-296.

Leave a comment