Blog 21: Good News, Bad News, Worse News

Well, here I go again. I am back to blogging after a prolonged absence for a wide variety of personal and business reasons. Strap yourself in and stay tuned.

I was at the recent Joint Statistical Meetings (JSM – the largest gathering of statisticians on the planet) including many professional societies from around the world, with the flagship professional society being the American Statistical Association, ASA. I have a variety of blog topics to cover from that meeting, and I will start with some discussions about statistics and data science. I had multiple conversations with frustrated statisticians who experienced what they perceived as inappropriate use of quantitative methods by professionals calling themselves data scientists. There were also sessions about how statisticians and data scientists can and should collaborate more as kindred professionals. While none of this is new, it made me realize a pattern in the conversations [Note 1: The pattern was detected by the neural network between my two ears, which you may like or distrust, depending on your view of my neural network and AI/ML!].

I put that pattern into three classifications, using regression trees (just kidding!) that I describe as Good News, Bad News, and Worse News. First, the Good News. I will stick to medical/pharmaceutical situations and examples since that is my area of direct expertise and where I have had multiple interactions with data scientists.

Good News

Scientists, in fact all people, like to hear good news. Someone discovered …

  • a new biological mechanism of disease;
  • a new biomarker that predicts some outcome;
  • a new drug that substantially reduces morbidity or mortality;
  • a heretofore undiscovered relationship between physician characteristics and prescribing behavior;
  • an association between hearing loss and dementia.

It’s exciting! Look what I/we found. Our scientific efforts have paid off in a new discovery. That is what science is all about – discovering new things. The more unexpected or novel or important the better.

In this age of vast data stores, there are so many opportunities to uncover new findings. And with indescribable computing power, it is relatively easy to plug such data into sophisticated algorithms to find patterns, relationships, associations, etc. in the data. It’s also rewarding. Those on the cutting edge of discovery get promoted, get funding, get attention, get praised. It creates a cycle of taking more actions and analyses, leading to more findings, and so goes the cycle.

You may notice that I did not refer to this as a virtuous cycle, though it very well could be. I omitted the word intentionally because as I have noted previously in other blogs and in the paragraph above: getting answers from large datasets is easy; assessing the quality of those answers is very difficult.

I refer you to a paper recently published by myself and my former colleagues at Eli Lilly & Company in the Biometrical Journal. https://onlinelibrary.wiley.com/doi/full/10.1002/bimj.202200164. In short, we simulated a vast array of realistic clinical trial datasets – 1200 distinct clinical trial datasets to be exact – and posted a challenge on Innocentive for Solvers to find the subgroup of exceptional responders inside those datasets. That is, using the covariates (i.e., features) provided, find the subspace of the entire dataspace in which the exceptional responders lie. So, for example, it might be if patients are less than 50 years old and had a serum creatinine below 1.5mg/dL, then they were more likely to have a positive response to a treatment. In the challenge, there were 40 covariates in each of the 1200 datasets with different levels of complexity for defining the subgroup of exceptional responders. That is, the subgroup could be defined simply as the patients with X22<40, or as complex as x9=2, and x21>55, and X33<37. [Note 2: The covariates did not have explicit descriptions and were merely labeled as X1 through X40. Some covariates were categorical and some were continuous. See the paper for details.]

One very interesting aspect of this entire exercise is that we generated 280 of the 1200 datasets with no subgroup of exceptional responders. That is, there was nothing but noise – random variation between the covariates and the outcome/response. Of the 748 interested parties who signed up for the challenge, the overwhelmingly vast majority found patterns in those “null” clinical trial datasets. They defined subgroups based on covariates where the underlying simulation model was defined to have no relationship between covariates and clinical outcome measures. This is not surprising. Given a large enough dataset and enough variables, even if all variables are generated randomly with no relationship between them, there are likely to be spurious patterns in the data. So, when a data scientist once said to me, “Give me a large enough datasets and I guarantee that I will find the patterns in it,” I just shrugged my shoulders. So what! Anyone with a little bit of computing skills can do that. In fact, the bigger the dataset, the more likely you are to find a pattern.

The point is that when this happens in the real world with real data, it is tempting to declare success: “Look what I found!” Good news – there is something this data is telling us, and I sorted through all of it using AI or ML to find it! We need to take action!

Bad News

Let’s examine the other end of the spectrum – the Bad News.

First, what do I mean by “bad news?” One form of bad news is when there are new findings that do not support perhaps an established convention. Biennial breast cancer screening for women starting at the age of 40 was determined to generate more risks for women (i.e., unnecessary cost of screenings, false positive diagnoses leading to more or inappropriate procedure for follow-up, etc.) than waiting until age 50 to perform routine mammography. That set off a firestorm of publicity, public backlash and ongoing controversy in the healthcare system. Another example is hormone therapy for women, a bedrock of medical care for decades, was shown to have some serious long-term detrimental effects. Laws or public policy decisions to combat one societal problem often have unintended negative consequences that reverberate through the news or other public fora. Eggs are bad for you … so is caffeine! Omeprazole, pollution and a thousand other things are associated with dementia … or heart disease or depression or … whatever.

Watch out! Beware! Use caution! Warning!

The result is much the same as Good News. Researchers are rewarded for their findings – notoriety, prestige, perhaps promotions – and request more funding to study the problem or the association more. How many times have you heard a story or read an article – scientific or otherwise – that ended with, “We solved that problem, and it is time to stop spending any more resources studying it.” In many endeavors of public health, safety, law and order, equity, etc., there is always more to be done because the questions are difficult to answer.

[Note 3: In my personal experience, bad news can lead to the termination of resources and efforts. Most notably this occurs when a clinical trial of a new treatment fails, and the Sponsor discontinues that research program. These pharmaceutical sponsors are a business, and they have to carefully weigh how they invest their resources.]

What is interesting is that Bad News can be a lot like Good News, as noted above. Researchers get rewarded and there can be a lot of positive feedback and benefits that accrue to the researchers communicating the bad news.

Worse News

The worst news is neither good nor bad – it is just no news. It is reporting that, despite lots of time and effort, the data has not relented to our endless analytical torture to confess an answer. This dataset, no matter how big it is, or you think it is, does not contain the right information to answer the question of interest. Or it is too complicated, too confounded, too messy to provide any reliable information. We analyzed the data left, right and center, top to bottom, and there were no trustworthy patterns, no subgroup of exceptional responders, no way to “rescue” this failed study. There is no model that accurately or reliably predicts COVID-19, or hospitalizations, or ICU admissions, or death from COVID-19. In the case of CVOID-19, there were hundreds of paper claiming to predict some aspect of the COVID-19 pandemic and patient outcomes, but I have yet to see a publication that says, “We analyzed all this data and couldn’t find a reliable model.” The vast majority of the time, the answer was “We developed a ML model that had high AUC and should be used in clinical practice.” Upon further review, none of these were deemed reliable or clinically useful (Wyants et al, https://www.bmj.com/content/369/bmj.m1328; Sperrin et al, https://www.bmj.com/content/369/bmj.m1464).

So, worse news is reporting, “Hey, there is nothing here. Let’s all go do something else.”

It’s hard to know how many experiments, clinical trials, observational studies, exploratory analyses of customer response failed. They simply do not see the light of day – in publications, press releases or even within an institution, be it private or public. It is the well-known “file drawer” problem: we simply do not know how much unsuccessful research is simply filed away and forgotten. Clinical trial registries have been established in the past two decades to help reduce this problem so that people can track planned clinical trials, and if results are not published in a timely manner, the sponsors of the research can be nudged to get their results into the public domain. But in general, researchers do not say, “Hey, look at all this work we did, and by the way, we didn’t find anything.” Publishers and news agencies generally do not think of this as “news” and so it goes left unreported.

This is the worst news. There is no notoriety, no publicity, no additional funding, perhaps reduced chances for promotion. The researchers might even be tagged with the moniker, “that was a bad idea” or “they didn’t know what they were doing” or “they really messed up.”

The Conundrum

I perceive this as the ongoing tension between statisticians and data sciences (if I may use broad generalizations for these closely related disciplines – maybe the same discipline?). Statisticians seem to think that data scientists live in the Good News and Bad News arenas. Do an analysis – any analysis. Explore the data. Torture it until it confesses something. [Note 4: Nobel Laureate in Economics Ronald Coase was quoted as saying, “Torture the data long enough and they will confess to anything.”] There must be an answer is this big data. When a “finding” is produced, then there is a rush to communicate it, whether it be internal to an institution or publicly. Hasty, heedless, unconstrained. Yes, we have answers!

Data scientists perceive statisticians as the proverbial “wet blanket.” Truth be told, so do many other professionals. Statisticians might be accused of living in the “Worse News” arena. When examining a “finding,” statisticians start by asking where the data came from. How was it collected and what do the variables truly represent. They look for bias in the data. then they ask about the analysis – what assumptions are being made? Is there over-fitting? How does one assess model fit and reproducibility. They conclude with asking about how to communicate the “finding.” How confident are we of this finding? Is it worth acting based on this finding? Do the benefits of taking that action outweigh the negative or unintended consequences of the action? Rigorous, deliberate, cautious. No, let’s not get too carried away with this finding.

Perhaps it is merely or largely a matter of dealing with observational data and experimental, controlled data. The vast majority of the world’s data is observational, and data scientists have emerged with massive computing tools to handle enormous observational datasets, attempting to make sense of what useful information/answers might be in the data. Statistical history is one of rigorous mathematics and experimental design to provide credible answers to specific scientific questions. Of course, statistics is also foundational to epidemiology, which very often operates in the realm of observational data, and statisticians also work in the arena of observational data. Statistical tools for experimental and observational data have been around for a long time and have matured through decades of practice and additional methodological research.

I would prefer more data scientists to acknowledge when their work is exploratory, when their data is observational and all that implies about confounding, bias etc. I would prefer that data scientists read and understand the history of data analysis using statistical methods for experimental and observational studies. [Note 5: I saw a data scientist promoting their amazing AI tool to find patterns in a large commercial dataset. Upon cursory review, the AI algorithm was merely stepwise logistic regression, a statistical technique that has been around for at least 70 years.]

I would prefer statisticians to accept that data can be messy, but it is still worth digging into it. Most questions cannot be answered by randomized, controlled trials. I would prefer statisticians to be a little less mathematical and a little more open to exploratory data analysis as John Tukey recommended in his 1962 article, The Future of Data Analysis.

I think statisticians and data scientists are in the same field – analyzing data to uncover important relationships in that data – hopefully ones that are valid cause-and-effect relationships. [More in my next blog.] I think we can benefit a whole lot from each other’s perspectives and ideas and approaches.

Personally, I have been involved with making more sense out of subgroup analysis by combining modern search algorithms from the data science world with statistical rigor for controlling false positive findings and adjusting for bias in the analysis. This has become known as ‘subgroup identification” and there is a rich literature in the statistics community around this. Once again, I will refer the reader to the recently published article on a platform for comparing subgroup identification methodologies (Ruberg et al, https://onlinelibrary.wiley.com/doi/full/10.1002/bimj.202200164).

Hopefully more integration of efforts will continue, and we will all end in a better place – using data for the common good.

2 thoughts on “Blog 21: Good News, Bad News, Worse News

  1. Thank you Steve for the great blog with valuable insights both on current data analytics culture and advances in methodology. Bridges between statistics and data science appear to be growing much larger and I find myself continually traversing them and making connections. A few examples:

    1. The first step of your intuitive Virtual Twins approach (Foster, Taylor, Ruberg, 2011) is also known as S-Learners in machine learning causal inference for estimating individual causal treatment effects (https://academic.oup.com/jrsssa/article/185/3/1115/7068887). Additional assumptions (beyond standard statistical ones) are required to interpret them causally.

    2. Gradient boosted tree methods (XGBoost, LightGBM, CatBoost) remain very popular in data science competitions (e.g. Kaggle and DrivenData) and tend to outperform Random Forests.

    3. Modern neural nets (e.g. PyTorch and TensorFlow) are also great for estimating heterogeneous treatment effects and often outperform boosted trees.

    A thought on subgroup analysis: If an enhanced treatment effect is driven by a continuous covariate (as is the case in a substantial fraction of the simulated 1200 datasets your nice Innocentive project), then it seems “subgroup” kind of loses its meaning and we would typically not see clear multimodality in individual treatment effects. This would imply the real focus for applications in personalized medicine should be on individual causal treatment effects themselves and not on average population causal effects in discrete subgroups.

    Like

    1. Russ,
      Thank you for your thoughtful comments (as always). Taking each of your comments in turn …

      Yes, I think there are a growing bridges. I still see some pretty big chasms. For example, the Data Scientist and VP at a large consulting firm to the pharma industry saying, “We do not need causation anymore. With big data, correlation is enough.” What?! Alarming.

      Thanks for the reference to S-Learners. Yes, I have always thought of subgroup identification (as distinct from subgroup analysis) as a combination of “DS tools” (search algorithms, black box models) and statistical rigor (control of false positive findings, adjustment for bias in treatment effect estimates in subgroups). There is lots of literature on subgroup identification since our original 2011 paper. I think it is a success story for the marriage between DS and Stats.

      I am agnostic to the methodology used – boosted trees, RF, other ML algorithms like NNs – and more interested in the properties of it performance not on one data set, but for use across many datasets. Hence our Innocentive challenge involving 1200 datasets of varying dimensions.

      As for subgroup analysis with a continuous covariate … several thoughts come to mind. (a) The best example of this that I was closely involved with was with ramucirumab for hepatocellular carcinoma while at Lilly. While not giving away inside information, I can point to the public literature on a “failed” trial (REACH: https://www.thelancet.com/journals/lanonc/article/PIIS1470-2045(15)00050-9/fulltext) with a continuous biomarker (alpha-fetoprotein) for which higher levels conferred a greater survival benefit. While there was no clear delineation or no effect level, the company chose to do another enrichment study of patients with AFP>400. That study was wildly successful (REACH-2: https://www.thelancet.com/journals/lanonc/article/PIIS1470-2045(18)30937-9/fulltext). (b) Jason Hsu and I (along with other colleagues) published a paper on this problem (Thresholding a Continuous Biomarker … https://www.tandfonline.com/doi/abs/10.1080/19466315.2016.1206486?journalCode=usbr20). It’s a real interesting problem and harkens back to my days of dose response analysis. When there is a continuous dose response curve, and as such any dose has some effect, how do you select a dose? Answer, find the dose with the lowest clinically meaningful effect, and do a study that shows it is statistically significantly different than placebo. See some of my earlier papers on dose response, which I think you know about.

      I agree with your statement about individual causal treatment effects. The ICH E9(R1) guidance defines a treatment effect that way, which is good. That is why my current research efforts include the general topic of estimands and the particular topic of causal inference, most notably for patients who can adhere to their treatment (so-called Average Adherers Causal Effect – AdACE). It’s still an average, but perhaps on the right patients. (https://www.tandfonline.com/doi/abs/10.1080/19466315.2019.1700157?journalCode=usbr20)

      Anyway, this is a very long response, but I enjoyed writing it. I hope you enjoyed reading it (if you got this far)!

      Liked by 1 person

Leave a comment