Blog 22A: Statistics and Data Science – The Two Cultures

Preamble

I have written on this topic before (see Blogs 2, 3, 4, 17, 18 and 21), and this is a continuation and clarification of my thinking. This is a bit lengthy, so hang in there. I think you will enjoy reading it to the end.

As noted in Blog #21, I was at the recent Joint Statistical Meetings (JSM – the largest gathering of statisticians on the planet) including many professional societies from around the world, with the flagship professional society being the American Statistical Association (ASA). I am continuing with blog topics about statistics and data science. This always seems to be a topic of interest to statisticians and quite often a difficult, controversial, and sometimes contentious conversation. Thus, for obvious reasons, I will omit any mention of any individuals and stick to MY perceptions, interpretations, and perspectives.

I have given talks on Stats and Data Science in many statistical forums, and I am continually trying to sharpen my thinking on the matter with the sincere hope for synergy if not convergence among these communities. I believe the reason for misunderstandings and friction between the two professions – when it does exist – is because of the two cultures, which have three underlying contrasts that I will elaborate here. But before I give my account of the two cultures, one must be cognizant of two significant tomes on the subject of Statistics and Data Science. They are Leo Breiman’s 2001 article in Statistical Science entitled “Statistical Modeling: The Two Cultures” and David Donoho’s (2017) more recent perspective in the Journal of Computational and Graphical Statistics entitled “50 Years of Data Science.” They are comprehensive, scholarly papers (Breiman’s has several Comments from very famous statisticians), rich in philosophy and epistemology. While they are a good basis for my own thinking, and at the risk of being self-aggrandizing, I would like to extend their perspectives and add my own perspective on the two cultures with the hope that in doing so, a better bridge can be built between the Statistical and Data Science professions, or perhaps recognize that the cultures are manifestations of the same essence.

Two Cultures Defined by Three Contrasts

I will distinguish the two cultures along the lines of data, analysis and interpretation – perhaps the three fundamental steps in the analysis of data. I believe the two cultures are characterized as follows:

 StatisticsData Science
Data sourcesExperimental DesignObservational data
Analysis approachModelingFitting data
InterpretationDefinitive conclusionsPossible insights
The Two Cultures Based on Three Dimensions of Data, Analysis and interpretation

These three dimensions all work in concert, but I will parse them here before tying them together at the end of this blog. Of course, as with any generalizations, these three contrasts and what I write below are not uniformly true or valid, but I believe they are directionally correct and worthy of consideration.

Data Source

There are fundamentally two kinds of data sources – those that are collected under the control of the researcher (experimental data) and those collected in the course of natural circumstances or everyday life (observational data). One can argue this point since the sources of data have exploded in recent decades, but I believe this dichotomization is useful for many purposes and certainly in describing the perspectives of Statistics and Data Science. The key difference between experimental data and observational data is that experimental data is collected under controlled conditions in which each measurement is done precisely with plenty of context about the experiment, and observational data is taken “as is” without necessarily knowing why or how the data was collected or the context in which it is collected. This latter fact means that the observational data most likely omits factors that influence the analysis/outcome variable of interest, and therefore, cannot be modeled since they are unmeasured or confounding variables.

OK, there is a LOT more that can be said here, but for the sake of brevity and the argument given here, I will leave it at that. Now, for some history in this context.

At its inception centuries ago, Statistics was involved with the summarization and analysis of observational data. Think of the work of John Graunt and his famous book Natural and Political Observations Made upon the Bills of Mortality (1662) and many other examples, including the work of Carl Friedrich Gauss in 1795 who developed the theory of least squares to reduce measurement error in (observational) astronomical data.

In the last century, the emergence of statistical theory and methods has involved experimental design, spurred in great part by the work of Sir Ronald Fisher and colleagues at the Rothamsted Experimental Station in England, now known as Rothamsted Research. Fisher’s Statistical Methods for Research Workers (1925) stands out as an inflection point. The introduction of mathematical statistics in the design and analysis of agricultural experiments (1920’s) was followed by ever-increasing sophistication in industry (e.g., Sir George Box and many others in the 1950’s) and simultaneously in medical research with the first randomized controlled clinical trial designed by Sir Austin Bradford Hill in 1947 to test the effects of streptomycin in tuberculosis. [Note the number of “Sirs” here!]

Experimental design has historically been an important element of graduate statistical education and ongoing training. Controlled clinical trials and other controlled experiments fit into this category as well.

My observation (not asking for pardon on this pun) is that what is now called “data science” grew out of the business world. As data collection became easier through various electronic mechanisms (e.g., bar codes, internet addresses, etc.), and storage became easier and cheaper, business were able to accumulate indescribable amounts of data [Note: I say “indescribable”, but we do have terms like “terabyte” and “petabyte” and beyond, but I maintain that the enormity of the concepts behind these words is almost inconceivable]. In the early days (circa 2000), data science activities were called “data mining,” and efforts were directed at searching this observational data for information by using queries and producing summary statistics or graphs of trends. As sophistication grew rapidly, data mining experts were using multivariate statistical techniques (e.g., cluster analysis, discriminant analysis, classification trees).

Personnel in IT Departments were the experts in the systems for capturing and storing large volumes. With a little ingenuity, they also became the “go-to” people for delivering queries results (e.g., how many customers bought both potato chips and beer) as well as some averages or trends. Many business people did not know about Statistical Science as a profession or thought that Statistics was something done in research or was highly mathematical/theoretical. Statisticians were not often considered as possible partners in the analysis of business data (Note: Again, my experience is from the pharmaceutical industry, but I think it is generalizable to other industries based on my attendance at some conferences.)

An example of my personal experience went like this:

Steve (as the Scientific Leader for Advanced Analytics) to Business Unit President: “I think there is a lot more we can be doing with the analysis of our business data. I’d like to discuss with you or appropriate members of your staff about the possibilities.”

BU President to Steve: “You do not need to worry about this. The IT Department has this all covered.”

In parallel, there was a commensurate explosion of biological and medical information. Genome sequencing and microarray studies generated unprecedented volumes of data, and electronic medical records were capturing many dimensions of patient data. The field of bioinformatics emerged early this century and sought to find patterns in such data as well as linking the molecular biology data with the clinical outcomes. Again, these are observational data that are subject to confounding, measurement error, and many other anomalies and perturbations. There was a proliferation of “informatics” fields (medical informatics, business informatics, legal informatics …), and it appears in more recent times, these professionals/fields have described themselves ‘data scientists.”

Within a decade, there was a resurrection of the notions and tools of machine learning and artificial intelligence to handle such large volumes of observational data and “learn” what information they may contain. This brings us to the current date with the hope and the hype of ML/AI.

I should also note that there were also forays from the business analytics world of data science in the area of controlled experiments, which have become known as A/B testing. Though A/B testing dates from the early 1900’s, it did not play a prominent role in “data science” until the advent of more serious business analytics in the late 1990’s and early 2000’s. Of particular note for this blog is that concepts and practices in data science often have their origins in earlier statistical literature, as noted by David Donoho is his aforementioned 2017 paper.

It is worth noting that epidemiologists and economists have long dealt with the collection, analysis, and interpretation of observational data. There is a long history of methodology for dealing with the collection of health data and economic data, and ultimately the analysis and interpretation of such observational studies. This rich history of observational data seems to have escaped scrutiny by data scientists as well.

Finally, in the realm of data, statisticians have long analyzed numbers or categories, which can also be reduced to numerical information. With the emergence of pictures, video, audio, and other analogue information, there has been a burgeoning field of data reduction – e.g., how to reduce a waveform or a vast set of pixels into quantitative information. Statisticians have been slow to engage in such efforts while the “data science” world has been pursuing this vigorously in what might be called “data engineering.”

Analysis

This is where I would like to make an important distinction between analytical methods and one that is perhaps closest to what Leo Breiman wrote in the aforementioned 2001 article. It is the difference between (1) modeling data and (2) fitting equations to data.

  • Breiman used “data modeling” to describe the notion that there is an assumed model for which the goal is to estimate parameters of the model to understand the workings of Nature. Donoho changed this terminology slightly and called this “generative modeling.” That is, there is an underlying generative model that converts inputs (x’s) to outputs (y’s). The main goal is inference about parameters in the underlying model for which there is considerable statistical machinery that has been in place for decades with ever-new theory and methodology evolving and emerging continuously.
  • Breiman used the term “algorithmic modeling” to describe the process of fitting equations to data. In this arena, the analyst does not know the mechanism by which Nature is producing observations (y’s), and the goal of the analyst is to relate the inputs (x’s) to the outputs (y’s) via a well-fitting algorithm to make future predictions. Donoho referred to this as “predictive modeling.”

With a firm foundation in experimental design and controlled experiments based on randomization, it is easy to see how statisticians fall squarely into the “data modeling” or “generative modeling” culture. Data is generated according to a defined, experimental plan so that the factors influencing a response are known and controlled. Therefore, a statistical model can be clearly written, including all the factors and possible interactions with analysis using ANOVA, response surface methodology, general linear models and various other regression models. Important variables were known and their quantities controlled (e.g., amount of fertilizer used or the dose of an experimental medication). Meaningful covariates might be included in the model (e.g., analysis of covariance – ANCOVA). Parameters of that model are estimated and inference made by well-established statistical methods.

This is all to say that there were generations of statisticians trained for decades in the concepts of experimental design, hypothesis testing and modeling based on such designed experiments with clear objectives and PROVEN (I mean mathematically proven) methods for unbiased estimation, Type 1 Error rate control, precise confidence interval . The use of randomization for making causal inference about the effects of the experimental intervention was paramount.

But very few scientific questions can be answered by randomized, controlled experiments. Think about understanding or quantifying the effects of smoking. Even in the 1950’s when this was a legitimate medical and scientific question for which the answers were unknown, it was impossible and unethical to randomize some people to smoking for a long period of time (say 10 years) and another group to not smoking for the same period and measure clinical outcomes (death, cancer incidence, heart disease, etc.) to make causal inferential conclusion. The list of important societal questions that cannot be studied via randomized, controlled experiments far exceed those that can, as noted by Breiman from his days as a consultant.

The fields of epidemiology and economics have recognized this for decades if not centuries. These fields have dealt with issues related to bias in observational data, unmeasured or hidden confounding variables that influence outcome, but cannot be measured or even recognized. These fields have developed methodology – such as propensity scoring – that attempts to create similar groups for comparing different interventions. In other words, make observational data/experiments as much like randomized experiments as possible.

It is my experience and observation that “data science” emerged in the IT or computer science world to support business analytics, but the IT professionals who called themselves data scientists were not trained in or aware of the rich history of statistical methodology and observational research methods (see David Donoho’s paper for many examples of this). This unfamiliarity led “data science” to reinvent the wheel, to relearn the hard lessons of past statistical and epidemiologic research, and in the worst case, to conduct inappropriate analyses to the chagrin of statisticians, epidemiologists, and economists (though I must admit I am MUCH MORE aware of the consternation with statisticians.)

My experience: During my time at Lilly around 2010-2017 as the Scientific Leader of Advanced Analytics, my group was often asked to evaluate “analytics” vendors. One vendor visited and was touting their AI software … how their proprietary “algorithm”  sorted through vast amounts of data and many variables to find the optimal prediction algorithm … how they had unique capabilities well beyond their competitors and what Lilly might have internally. It turned out they were simply using stepwise logistic regression, which I noted to them had been published in the statistical literature for 50 years or even 70 years, and it was widely available in commercial and open-source software. It didn’t stop them from hyping their tools/business.

Lastly in the realm of analysis, I will relay a story that I use to distinguish statistics and data science. The story is based on an analogy of cooking but has direct translation to many data science projects I observed and papers I have read.

Suppose I want to make dinner this evening. I decide I want some pasta and I am going to start with some alfredo sauce. I like alfredo sauce. I also like onions so I will add them. And chicken too. I like peanut butter so I will put some of that in. Then some BBQ sauce, blue cheese, mustard, cilantro, lima beans and mango chutney – all of which I like. They are all going into the pot.

Perhaps that sounds unusual if not crazy. Those ingredients do not seem to go together, and you may even be cringing as you think about eating it. I suspect you do not find this concoction to be appetizing. The fact is that I do not know how this sauce would taste; nor do you. Maybe it would taste good, maybe not. I suspect this combination of ingredients has never been created or tried before so how it might turn out is completely unknown (though your intuition and experience may give you some appropriately skeptical insights!).

Now go to a scientific paper that is using “data science” to explore data to find the hidden patterns in it. Patterns that would help us predict, say, the risk of progressing to Alzheimer’s Disease using blood biomarkers. There is such a paper published in a prestigious journal – Nature Medicine. It appears to be an analytical tour de force where the statistical methods used are excerpted here (it is much too long to describe in total):

“groups were defined primarily using a composite measure of memory performance”

“Metabolites defining the participant groups were selected using the least absolute shrinkage and selection operator (LASSO) penalty.”

“… metabolomic data from the untargeted LASSO analysis to build separate linear classifier models …”

“… used receiver operating characteristic (ROC) analysis to assess the performance of the classifier models …”

“… employed internal cross-validation …”

“The optimal value of the tuning parameter lambda, which was obtained by the cross-validation procedure, was then used to fit the model.”

“… matched … participants on the basis of age, sex and education level.”

“… used separate multivariate ANOVA (MANOVA) to examine discovery and validation group performance …”

“… used Tukey’s honestly significant difference (HSD) procedure for post hoc comparisons.”

“… quantitative profiling data was subjected to the nonparametric Kruskal-Wallis test … followed by Mann-Whitney U-tests for post hoc pairwise comparisons …. Significance was adjusted for multiple comparisons using Bonferroni’s method (P < 0.025).”

You see, each individual statistical manipulation of the data appears logical and useful – just like each ingredient on its own can be quite tasty and delicious. However, I argue that this collection of methods applied in this way has never been done before and we have no idea about its operating characteristics when taken collectively and applied in this particular order. Does this collection of methods produce accurate answers? Unbiased? Can it identify the right predictors? What level of uncertainty would the analyst apply to the prediction? Using my intuition and experience (as with the cooking analogy), I was extremely skeptical of the findings published, even in such a prestigious journal! [Note: A subsequent more comprehensive study of these blood biomarkers in patients at risk for progressing to Alzheimer’s Disease was not able to reproduce the original findings, and the original findings are of no use the medical/scientific community … but as noted in my previous Blog #21, I suspect some of those researchers received additional funding, got promoted, gained esteem even if several years later their findings turned out to be completely spurious.]

To summarize, statisticians tend to be involved with analyzing data based on a model or experiment using methods that have well-defined operating characteristics based on mathematical proofs. Data scientists use many data manipulations or black box algorithms to evaluate data and make predictions with less consideration as to the properties of their results (at least to my knowledge). So, what’s the right approach? Not all important questions can be studied in controlled experiments. Not all analyses, even when using well-known methods of analysis, can be trusted when used in ad hoc ways. So, the “right” approach depends on the interpretation of the results, which I will discuss next and is related to the third dimension of the two cultures.

Data Interpretation

Given the previous discussions about data sources and data analysis cultures, the distinction in data interpretation follows quite naturally.

For Statistics, the interpretations are more definitive, especially when analyses are prespecified (i.e., prior to seeing the data), and more quantifiable in terms of their accuracy, false positive rate etc. It’s all part of the statistical machinery. Now it is true that many scientific findings that emanate from rigorous statistical designs and analyses cannot be replicated (i.e., the so-called reproducibility crisis in science), but that is another topic for discussion related to the use of hypothesis testing and p-values versus Bayesian posterior probabilities for decision-making. See for examples my publications (Détente and Reproducibility). So, statisticians should be a bit cautious about claiming any epistemological superiority to data science because of the rigors that they bring to the scientific process. In my decades of pharmaceutical drug development experience, I have observed many very promising Phase 2 clinical trials results from well-designed studies for an experimental treatment that fail in confirmatory Phase 3 trials. It is a well-known fact in drug development – more drugs fail than succeed despite the best designed research and statistical methods. Mother Nature has a fickle personality.

For Data Science, the interpretation needs to be one that recognizes the exploratory nature of much of their work based on the observational nature of the data and the ad hoc/black box data fitting approach. Even the best fitting, cross-validated, comprehensive machine learning algorithm based on “big data” may only be applicable to that dataset used in the data fitting exercise. Is it generalizable to other datasets, circumstances and contexts? Google Flu failed to be extensible beyond the years that were used to build the prediction model. A machine learning model to predict acute kidney injury built on the US Veteran’s Affairs electronic medical records database performed poorly when used in a UK hospital (Note: no surprise there!). The vast majority of ML/AI algorithms for predicting COVID-19 infections, hospitalizations, or death failed broader implementation because they were developed and validated on datasets from one institution despite having highly touted AUCs or other measures of predictive accuracy.

Statisticians can be frustrated or even haughty when data scientists use age-old statistical methods to analyze data and claim novelty of their approach. Every analysis method these days seems to be touted as AI or ML. That’s how one gets attention. Worse yet is when data scientists use exploratory data analysis methods that have been used by statisticians for decades (see Donoho’s paper) and claim that this is their domain. There is a false dichotomy between data analysts/scientists and statisticians that was unfortunately highlighted in a Harvard Business Review (HBR) article in which the portrayal is that data scientists or analysts do quick and dirty exploratory analysis and statisticians are only called in when a potential finding needs more rigorous analytical methods. That HBR article states,

“Good [data] analysts have unwavering respect for the one golden rule of their profession: do not come to conclusions beyond the data (and prevent your audience from doing it, too). To this end, one way to spot a good analyst is that they use softened, hedging language. For example, not “we conclude” but “we are inspired to wonder”. They also discourage leaders’ overconfidence by emphasizing a multitude of possible interpretations for every insight.”

While this might appear to be good advice, it begs the obvious question, “How is this helpful if there are multiple interpretations for every insight from which we can draw no conclusions?” Furthermore, it is often too tempting to believe in what is found in exploratory analysis. As in my previous Blog #21, Good News, Bad News and Worse News, there is a driving compulsion to find something interesting in all that big data, and I have seen too many exploratory analyses of observational data where the caveats and golden rule stated above are lost in the translation and enthusiasm for pursuing an exciting finding/pattern or making an unfounded interpretation of the data.

It is worth noting that there is friction between statisticians and epidemiologists along similar fault lines. Statisticians doubt epidemiologists’ findings from observational studies and cite discrepancies and inconsistencies between observational findings and those from randomized, controlled trials. And yet, we know smoking is deleterious to one’s health from observational research done decades ago.

Data scientists can be frustrated with statisticians when they see an unwillingness to dig into complicated scientific and societal problems because the data is messy. Data scientists can be exasperated when statisticians pull out all the reasons why an analysis and interpretation should be viewed skeptically. Statisticians can be overly conservative (sometimes referred to as “wet blankets” or “data police”), and I have had data scientists tell me that exploratory or loose/ad hoc analyses (like the cooking example stated previously) are the nature of research.

Some Concluding Remarks

It is a difficult balance to know how to interpret analyses of complex experiments or observational datasets when assumptions need to be made or are violated. The quality of the data resides on a continuum from poor/confounded/biased to exquisitely pure from a well-designed and conducted experiment. But alas, even in very well-controlled, randomized, blinded clinical trials (the “gold standard”), not every patients follows the trial plan, takes their medication as prescribed, or stays in the study for its intended duration, thus leading to data quality issues. Similarly, the validity of all analyses depends on assumptions and models, which are abstractions of reality, and the validity exists on a continuum from highly accurate to reasonably plausible to hypothetical to black boxes. The interpretation of the veracity of the findings from any such combination of data and analysis depends on where the observer is on both of these continuums (or is it continui? I don’t know 😊). Two well-intentioned statisticians or data scientists can look at the same data and analysis and legitimately come to different conclusions about the degree of certainty in a finding.

Speaking as a statistician, the biggest issue that I have seen over the years is when the data scientists live in “their” culture (observational data, exploratory analysis) but make the leap to the “our” culture and portray findings as  definitive conclusions. The digital medicine literature is rife with examples of diagnosis and prognosis (i.e., predictions) based on observational data and data fitting approaches using ML or AI that lead to failures when being extended to other environments/institutions. [My next Blog will tackle the we/they (false) dichotomy of statistics and data science in more detail.]

The two cultures can actually be seen (as suggested earlier) as two manifestations or dimensions of the same goal – to use the best analysis approaches for extracting the best information possible from the data at hand. The data could be well-curated or messy. The analysis can be sophisticated or simple, using new computationally intensive techniques or old-fashioned methods from decades past. The results could be numerical or visual. It doesn’t matter. As long as the information extracted and interpretation made are understood in the context of the continuum of data quality and analysis validity. OK, this has been long enough. I have found this a bit therapeutic to get my ideas organized, built on experience and (in part) the seminal papers of two other notable statisticians. I hope these insights expand on the work of Breiman and Donoho. There is more to say on this topic, so stay tuned for my next Blog.

3 thoughts on “Blog 22A: Statistics and Data Science – The Two Cultures

  1. Steve – this is an outstanding, thoughtful comparison! Lucky me that I get to read this blog – and still have a chance to learn from you, your expertise and your thoughtful commentary – as if we are still work colleagues! 🙂

    Like

    1. Mark – Your reminiscence of us working together are good ones. We had a great relationship and found a way to work between Stats, Health Outcomes, Epidemiology. In that sense, we are all data scientists. My next blog (in prep) is about what “data science” is or means and what constitutes a data scientist. THANKS for reading and commenting.

      Like

Leave a reply to Mark Nagy Cancel reply