Note: This blog is a little longer than usual because it challenges some traditional notion of statistical thinking about subgroups, and I include a Case Study to explicitly make some key points about my thinking when evaluating subgroups in clinical trials.

**Background and Definitions**

I recently presented in a webinar hosted by the National Institute of Statistical Science and Merck entitled “Subgroup Analysis” hosted by Dan Holder of Merck. The entire webinar audio can be found at https://www.niss.org/events/niss-merck-virtual-meet-subgroup-analysis. Other speakers were Ilya Lipkovich and Rob Hemmings. I will note that their talks were very good. You should give it a listen. As for my talk, I will let you be the judge if you want to give it a listen. What I am writing here provides greater emphasis on one part of my talk in that webinar.

**Sir Richard Peto has been heard to say, “Always do subgroup analysis, but never believe them.”**

Subgroup analysis has received many negative or cautionary commentaries (1, 2 are just two examples with long lists of additional references), and it would be difficult to give an exhaustive review of the many perspectives on this topic. To address these concerns and balance them against this blog title, I must first make a distinction between ‘subgroup analysis’ and ‘subgroup identification.’

In (2), the authors state, “By ‘subgroup analysis,’ we mean
any evaluation of treatment effects for a specific end point in subgroups of
patients defined by baseline characteristics.” Most of the articles dedicated
to this topic are explicitly or implicitly discussing *post hoc* analysis:
(a) routine assessment for heterogeneity of treatment effects; (b) sometimes
for exploratory evaluation purposes; and (c) sometimes to “salvage” a failed
trial. That latter is when the primary efficacy result is not significant in
the overall population enrolled in the trial, but there may be a subgroup of
patients for which the treatment effect is medically meaningful or
statistically significant. Various authors have provided rules, guidelines or
checklists for subgroup analysis (3, 4, 5, 6).

By subgroup identification, [I believe the phrase was first
mentioned in (7)], I mean the systematic, pre-planned, statistically rigorous
evaluation of subgroups defined by baseline covariates to make valid
inferences. Built into this definition is the notion of “valid inferences” that
goes beyond the post hoc or exploratory subgroup *analysis* approach.
Subgroup *identification* is characterized by what I have called
Disciplined Subgroup Search (DSS) which is defined in (8), and has the
following six characteristics:

**Prespecification**: the algorithm/methodology to be used for identifying subgroups, the list of baseline characteristics/biomarkers that form the covariate space to be searched, complexity of subgroup definitions (i.e., how many covariates are allowed to define the subgroup), as well as any other options/decisions that can be made in the analysis process. (In short, this is no different than prespecification of any important analysis in a Phase 3 trial that adheres to the ICH-E9 Guideline.)**Adjusting for multiplicity:**how statistical significance (i.e.,*p*-values) of a subgroup finding will be adjusted for multiplicity. [Also, Bayesian approaches will be noted further into this blog.]**Bias correction:**how estimates of treatment effect are corrected for bias due to the selection bias associated with searching multiple subgroups.**Biomarker effects:**allows for separating prognostic biomarker effects from predictive biomarker effects.**Interactions:**allows for multiple biomarkers to be included in the definition of a subgroup – i.e. not multiple one-variable-at-a-time analyses.**Partition:**allows for identification of a cut-off value for a continuous biomarker that separates smaller treatment effects from larger treatment effects.

With a clear, intentional approaches described by the six characteristics above, ** I am now prepared to argue why one should always do subgroup identification** and how it renders subgroup analysis unnecessary and unadvisable, perhaps even irrelevant. There are two scenarios for subgroup identification both of which require the use of DSS.

**1. Homogeneity of Treatment Effect**

One scenario is to evaluate the homogeneity or consistency of a treatment effect across subgroups. This is typically done for well-defined patient characteristics of gender, race, geographic region as well as, in general, a set of medical characteristics or disease state parameters (e.g. disease severity, etiology, previous treatments). Traditional analyses are done one-variable-at-a-time and displayed as tornado diagrams or tables comparing the treatment effect in a subgroup and its complement. A typical example is below for Toprol-XL, taken from the US FDA label (9).

There are two subgroups that are highlighted where one might see a suspected differential treatment effect – US vs Non-US populations and male vs female. In response to these findings, the US label for Toprol-XL states the following (my emphasis is noted in bold):

**“The figure … illustrates
principal results for a wide variety of subgroup comparisons, including US
vs. non-US populations (the latter of which was not pre-specified). **The
combined endpoints of all-cause mortality plus all-cause hospitalization and of
mortality plus heart failure hospitalization showed consistent effects in the
overall study population and the subgroups, including women and the US
population.** However, in the US subgroup and women, overall mortality
and cardiovascular mortality appeared less affected. Analyses of female and US
patients were carried out because they each represented about 25% of the
overall population. **Nonetheless, subgroup analyses can be difficult to
interpret, and **it is not known whether these
represent true differences or chance effects.”**

Now, the subgroups defined in the above Figure were likely pre-planned or specified in the protocol, and if not, they are still generally a familiar or recognizable list. The problem is that there was likely no disciplined approach to their assessment as proposed by DSS. Subsequently the label, which should be informative to physicians and patients, is left with ambiguous statements. It is as if the Sponsor and FDA settled on something like, “We observed a difference in treatment effect in a couple of the many subgroups explored, but we don’t know whether they are real or spurious. We can’t figure it out, so go figure it out for yourself.” The official labeling statements and my parody are useless, and if the public cannot rely on learned people – who have access to all the data and much more data on many, many other drugs and their gender effects – to give a better explanation, the how can we expect them to know what to believe or decide.

What if the Sponsor would have taken a DSS approach from which reliable estimates of treatment effects in these subgroups and their statistical significance evaluated? There were ten baseline covariates evaluated that could have been pre-specified in the Statistical Analysis Plan (SAP). Any subgroup identification tool could have been used as long as it adjusted the p-values for multiplicity and corrected the bias in treatment effect estimates due to multiplicity. I refer you to a review article and other materials/tools from Ilya Lipkovich and Alex Dmitrienko (10, 11). With the DSS approach, more accurate, reliable estimates of the treatment effect in each subgroup could have been produced resulting in more definitive conclusions and advice for physicians and patients.

I have seen numerous other labels with confusing statements
about subgroups (e.g. belimumab, BENLYSTA®, use in Black/African American
Patients), and my own personal experience in pharma companies includes dealing
with situations when there is a differential treatment effect in some subgroup
of 10-20 that are evaluated. Usually, this resulted in an inordinate amount of
time trying to explain what is likely to be or appears to be a spurious significant
result. When these explanations are done *post hoc* or *ad hoc*, which
they usually are, it’s too late to make definitive assertions.

**2. Heterogeneity of Treatment Effects**

The second scenario is when there is a desire to have more targeted therapeutics in which a subgroup of patients with measurable characteristics can be identified to have an exceptional efficacy response or increased risk for adverse effects. The same DSS approach can be taken to attain more credible results in a subgroup of interest, even in a so-called failed trial when there is no significant treatment effect in the overall population,.

My example here is for ramucirumab for hepatocellular carcinoma (HCC) and its Phase 3 clinical trial (REACH) of ramucirumab versus placebo on top of best supportive care. The results of the trial for the primary endpoint resulted in a hazard ratio of 0.87 in overall survival for the ramucirumab vs placebo ([95% CI 0·72–1·05]; p=0·14). Thus, the study was a failure. The article notes that there were 10 pre-specified subgroups. One subgroup was defined by a biomarker called a-fetoprotein (AFP) and whether its value was <400ng/ml or >400ng/ml. The article did note that an AFP cut-off of 400 ng/ml was pre-specified because it is commonly used in prognostic scoring systems for HCC. As it turns out, this biomarker-defined subgroup displayed a differential treatment effect that was noteworthy (interaction p-value=0.027), not only for its magnitude but also for its biological plausibility. When analyzing the patient subgroup with AFP>400ng/ml, there was a significantly longer survival on ramucirumab versus placebo (HR=0.67 for overall survival; p=0.006).

Based on these findings, the Sponsor did a follow-up study (REACH-2) that enrolled only the enriched population of patients with AFP>400ng/ml (13). That study confirmed the overall survival benefit of ramucirumab versus placebo in patients with baseline AFP>400ng/ml (HR=0.71; p-value=0.0199). [I am intrigued at the official reporting of a p-value of 0.0199 as opposed to 0.02. Certainly, a p-value that has 0.01 in it seems to be quite strong and is probably remembered better … but that’s just me.] So, the bottom line is that ramucirumab (CYRAMZA®) was approved for this indication by FDA in May, 2019.

So, this sounds like a success
story, and it is … but wait! What if DSS had been formally done in REACH? What
if the AFP subgroup was formally included in a pre-specified analysis plan along
with other subgroups using DSS? What if a specific subgroup identification
search methodology was pre-specified? What if *adjusted* p-values and
effect estimates were calculated? *What if they were still significant and
meaningful?* Could ramucirumab have been approved in the AFP>400ng/ml
subgroup based on REACH in 2015 using DSS instead of

- Spending 3 years, and
- Many, many millions of dollars, and
- Tens of thousands of patients not having access to an effective medication?

To make this happen, one would
have had to use a multiple comparison procedure that allocated, say a=0.04, to the primary
analysis and a=0.01
to the DSS analysis. With a treatment-by-AFP interaction p-value of 0.027 in the
first study, REACH, it still would not have been significant, thereby
necessitating another confirmatory trial. HOWEVER, it seems that *the Sponsor
took a big risk betting on such a marginal interaction p-value in a set of 10
subgroup analyses*. If a DSS analysis had been done, then a more objective
assessment of the significance of the treatment effect in this subgroup could
have been computed as well as an unbiased (or less biased) estimate of the
treatment effect resulting in a more- or better-informed decision!

So, to counter Sir Richard Peto,
**in my humble perspective: “Always do subgroup identification using DSS
so the results are more believable.”**

As noted above, there were other considerations such as biological plausibility that made this subgroup more appealing, and the article does describe analyses of different cut-off values for defining the AFP subgroup that showed a consistent pattern of improved survival with increasing baseline AFP.

But wait … other considerations? Biological plausibility? Sounds to me like a prior!

**Bayesian Thinking**

As described in Blog No. 8:
Let’s Get Real – Bayes and Biomarkers, one could assign a prior to each
biomarker and then use Bayes Factor to help compute a posterior probability for
the hypothesis that AFP is truly a predictive biomarker. Because of the known strong,
prognostic effects of AFP and its link to the biology of HCC, one might have
been willing to give it a higher prior probability of being a predictive biomarker.
Table 1 shows various priors and two columns of posteriors probabilities. The
first posterior column is based on the interaction p-value of 0.027. The second
posterior column comes from the p-value of 0.006 based on the test of treatment
effect (ramucirumab vs placebo) for overall survival *within* the
AFP>400ng/ml subgroup alone. This latter analysis is intended to be more indicative
of what might be seen in a subsequent trial in this population, though one must
recognize the observed results are quite likely optimistic due to selection bias.

**Table 1**

Posterior probabilities that AFP>400ng/ml is a predictive biomarker for ramucirumab treatment effect in hepatocellular carcinoma.

Prior Probability | Upper Bound on Posterior Probability based on Interaction Test (p=0.027) | Upper Bound on Posterior Probability based on Treatment Effect in Subgroup with AFP>400ng/ml (p=0.006) |

0.20 | 0.485 | 0.750 |

0.30 | 0.617 | 0.837 |

0.40 | 0.715 | 0.889 |

0.50 | 0.790 | 0.923 |

Posterior probabilities (p_{1})
computed from prior probabilities (p_{0}) and the stated p-values in
the Table using p_{1}< {1 + [(1-p_{0})/p_{0}] / BFB
}^{-1} where BFB is Bayes Factor Bound.

Armed with these posterior probabilities (albeit upper bounds), the decision to proceed with another VERY expensive trial for multiple years seem more comfortable (at least to me). At least such probabilities are more explicitly addressing the question of interest (i.e. is AFP>400ng/ml truly a predictive biomarker?) rather than post hoc p-values for which there is a known optimistic selection bias, but unquantifiable without using DSS or a Bayesian approach.

So, I can go yet another step
in modifying Sir Richard Peto’s perspective and **I say, “Always use Bayesian
thinking when doing subgroup identification so you can quantify how
believable the results are.”**

In fact, in the Toprol-XL label, for example, I suspect the
prior on there being a male/female differential treatment effect would have
been quite low – perhaps as low as 0.05 or 0.01. Subsequently, the upper bound
on the posterior probability of that effect would have also been quite low, unless
there was a **very** small p-value (i.e. less than 0.001). I do not know the
p-value for that interaction test which is not reported in the original publication.
However, it is worth noting that the observed US/non-US and male/female
differential treatment effects were not even mentioned in the original article,
so I suspect the p-values were somewhat inconsequential. Again, had this been
done using DSS or a Bayesian approach, a more quantifiable answer could have
been obtained and a more clear conclusion could have been represented in the
label rather than the ambiguous, “non-statement” that is currently there.

**Recommendations**

- So, always do subgroup identification using DSS!
- This can be done for “defensive purposes” when the goal is to show homogeneity of treatment effect and to avoid consternation (and endless label haggling with regulators) trying to “explain away” what appear to be spurious findings. It can also help make labeling statement less ambiguous and more informative to physicians and patients.
- When done for “offensive purposes” – targeted therapeutics or rescuing a “failed trial” – one has to consider using a Statistical Analysis Plan that allocates some of its Type 1 Error Rate control (alpha) to DSS. This could be a split of alpha=0.04 to the primary analysis and alpha=0.01 to DSS. The slight reduction in a for the primary analysis does require an increase in sample size to maintain power. Alternatively, the sample size could remain the same with the recognition that the Sponsor is giving up some power on the primary analysis in order to gain power for finding a targeted therapeutic. This requires thoughtful analysis of the subgroup size, the potential differential treatment effect size in the subgroup and perhaps other factors in order to
.*optimize the overall chances of a successful trial*

- Interpretation of subgroup findings, no matter how they are achieved, requires Bayesian thinking. In Blog No. 5: pr(You’re Bayesian)>0.50, I argue (and demonstrate) that one cannot interpret a p-value by itself. It requires a prior. More formal use of Bayesian thinking could be implemented even if using a crude approach to declare your prior that each protocol-specified subgroup might have a differential treatment effect. This does not completely solve the problem, but again, it minimizes post hoc confusion and arguments over what to believe is real or spurious.

Finally, in a subsequent blog, I will tackle the notion of subgroups
defined by post-randomization effects, which is another “No-No” in the orthodoxy
of statistical practice. Yet, this is what physicians do in treating patients
and ‘personalizing medicine.’ Until then, make it a stellar day.

References

- Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA 1991;266:93-98.
- Rui Wang, M.S., Stephen W. Lagakos, Ph.D., James H. Ware, Ph.D., David J. Hunter, M.B., B.S., and Jeffrey M. Drazen, M.D. Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials. N Engl J Med 2007; 357:2189-2194.
- Brookes ST, Whitley E, Peters TJ, Mulheran PA, Egger M, Davey Smith G. Subgroup analyses in randomized controlled trials: quantifying the risks of false-positives and false-negatives. Health Technology Assessment 2001, 5(33).
- Rothwell PM. Subgroup analysis in randomized controlled trials: importance, indications, and interpretation. Lancet 2005; 365:176–86.
- Sun X, Briel M, Walter SD, Guyatt GH. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses, BMJ 2010; 340:c117doi: 10.1136/bmj.c117.
- Burke, J F, Sussman, J B, Kent, D M, Hayward , R A. Three simple rules to ensure reasonably credible subgroup analyses. BMJ 2015;351:h5651 doi: 10.1136/bmj.h5651.
- Foster, J C, Taylor, J M G, Ruberg, S J. Subgroup identification from randomized clinical trial data. Stats in Med 2011, 30: 2867-2880.
- Ruberg, SJ, Shen, L. Personalized Medicine: Four Perspectives of Tailored Medicine. Stats in Biopharm Res 2015. 7:3, 214-229.
- https://www.accessdata.fda.gov/drugsatfda_docs/label/2009/019962s038lbl.pdf (accessed 27 Sep 2019).
- Lipkovich, I, Dmitrienko, A, D’Agostino Sr., RB. Tutorial in biostatistics: data‐driven subgroup identification and analysis in clinical trials. Stats in Med 2017, 36, 136-196.
- http://biopharmnet.com/subgroup-analysis-software/ (accessed 27 Sep 2019).
- Zhu, AX, et al. Ramucirumab versus placebo as second-line treatment in patients with advanced hepatocellular carcinoma following first-line therapy with sorafenib (REACH): a randomised, double-blind, multicentre, phase 3 trial. Lancet Onc 2015; 16, 859-870.
- Zhu, AX, et al. Ramucirumab after sorafenib in patients with advanced hepatocellular carcinoma and increased α-fetoprotein concentrations (REACH-2): a randomised, double-blind, placebo-controlled, phase 3 trial. Lancet Onc 2019; 20, 282-296.