No. 17: Analytics, Data Science and Statistics – A Rose by Any Other Name …

  1. Background

This Blog also appears in Medium.com, the online publishing platform for social journalism. It has amateur and professional writers, and you can see and learn more at Medium.com. This Blog made it to the home page for Data Science and is also featured in “Towards Data Science,” which is a Medium publication online. I am also posting this Blog on Medium for the following reasons:

  • It has an established readership and broad reach; thus, increased visibility.
  • There are many people publishing on Medium in the arena of Data Science, and they frequently provide a view of Statistics and Statisticians that is inaccurate, in my opinion, based on my 40 years of experience. For example, in one post “Statistics for People in a Hurry,”

https://towardsdatascience.com/statistics-for-people-in-a-hurry-a9613c0ed0b

the author states, “Statistics is the science of changing your mind.” Subsequent discussion suggests the statistician’s job is to reject the null hypothesis.

  • I want to describe to the broad readership of “Towards Data Science” what I believe Statistics is all about and dispel misconceptions.

I am hoping that readers of this Blog will go to Medium and support my posts (on Medium there is a “clap” feature, but you can Like it here on my Blog as well or instead).

You can go to Medium through this link to see this Blog posted there,

Link to my article in Medium.

and if you are a statistician, perhaps see some other articles that might bewilder you. Whenever I write a Blog posted here on the general topic of Data Science, I will also post it on Medium “Towards Data Science.”

2. Introduction

I wanted to put another voice into the Medium on Data Science, Statistics, or, generally speaking, Analytics to broaden the perspective and add to the rich conversation on this topic. That is the purpose of this Blog as well. While my intent is to bring greater clarity to a world awash with data and analytics and its divergent lexicon, undoubtedly, I will also bring some confusion and discord to this expansive topic. Such is the state of affairs in the data and analytics arena. If nothing else, perhaps I can bring some better understanding, or at least a complementary understanding, to the emergent diversity of disciplines and practices related to data and analytics.

First, I believe most descriptions of analytics and data science are too narrow, and a more inclusive approach would benefit the community. In fact, since most data science is really about analysis of data, could we benefit from a more expansive name – data analytical science and data analytical scientists? Second, I believe that statistics as a disciplined is too frequently viewed as something outside data science and relegated only to rigorous analysis. Furthermore, statisticians are too frequently characterized as hypothesis testing, p-value generating number crunchers. For my first article for Medium, I will expand on the first topic and hopefully begin to dispel the second, though the latter will take additional articles.

So, where do I begin? Well, let’s begin at the beginning with some definitions

3. Analytics Writ Large

First, for me, I think of the term “analytics” as an expansive, inclusive terms. Analytics writ large means any use of mathematics to analyze data or build models to solve real world problems. It includes the people who use mathematical equations, algorithms, logic, computer code to do their work. That was my perspective as the Scientific Leader of Advanced Analytics at Eli Lilly & Company until I retired in 2017. I hired, worked with and managed professionals with a very wide variety of analytical degrees, including statistics (mathematical stats, biostats, business stats), mathematics, operations research, computational chemistry, economics, computer science and ML/AI. The complementary nature of these skills and capabilities was inspirational, and certainly the “whole” was greater than the sum of the “parts.”

There were other analytical disciplines in our company that I and others considered to be part of the “analytics community” for which I did not have direct responsibility but were in league with our Advanced Analytics group. These included:

  • epidemiologists and health outcomes researchers who work primarily in the observational data space to assess the natural history of disease and real-world outcomes of drug/biologic treatments;
  • pharmacometricians, who build pharmacokinetic and pharmacodynamic models of a drug’s interactions with an organism (animal and human) to predict optimal dosing regimens, drug-drug interactions and clinical effects of treatments;
  • systems biologists, who create mathematical models of disease at the tissue and cellular level in order to elucidate the potential positive and negative effects of new molecular entities and their biological mechanism of action;
  • bioinformaticists, who link information across genetics, disease and molecular biology databases to find what biochemical pathways may be involved in a disease process so that other scientists can design molecules, antibodies or other treatments to arrest disease;
  • econometricians, who build micro- and macro-economic models to describe our world and predict possible future outcomes.

The Analytics field is broad and deep!

The well-educated professionals who perform this wide array of data analytical endeavors each have a special tool set and knowledge base that inevitably stems from mathematics. Some know differential equations, some know probability distribution theory, some combine their math skills with deep biological or medical knowledge, some know business, finance or marketing, some know optimization algorithms, some know modeling and simulation, some know various deep learning approaches, some have superb coding skills.

Side Note 1: I must say that working in the pharmaceutical industry and leading a very talented group of advanced degree data analytical scientists was a mind-expanding experience and a real joy. Trust me, there were very few, if any, stereotypical nerds or back-office number crunchers in these ranks. These folks understood the math, engaged in the science/business questions, strategies and problems, interpreted their findings and communicated their conclusions in language that business partners could understand.

Side Note 2: I am using the phrase data analytical scientists because in the scientific world, the term “analytical scientist” is already taken by the biological/chemistry crowd. These are the people who perform sophisticated assays, for example, to determine the chemical constituents of a fluid (e.g. the concentration of drug X in plasma). Since they are routinely referred to analytical chemists or bioanalytical chemists, I will use the term data analytical scientists to distinguish the cast of disciplines noted above in my “big tent” of analytics.

Side Note 2a: The above distinction did not dawn on me in my early forays into the broader realm of analytics. When I first became the leader of Advanced Analytics at Lilly, I started getting phone calls from headhunters (as we all do) telling me about job opportunities at other pharma companies to lead their chemistry departments. Often puzzled, I would say, “You must have the wrong number or the wrong guy. I am not a chemist in any way, shape or form.” After a number of such phone calls (maybe I am a slow learner), I realized what was going on.

4. Statistics

Now, for full disclosure, I am a biostatistician by training (with the attendant mathematics along the way), though over decades of practical problem solving in the pharma industry, I consider my knowledge and perspective to go beyond the traditional or stereotypical view of statistics or statisticians.

OK. With my disclosure cleared and my definition of analytics in place, I will go to the next important definition since it seems to engender a lot of confusion and consternation in our present data and analytics world. What is Statistics? Although others have crafted new definitions of this age-old profession, especially in the context of “data science,” I am going to offer a different perspective on what statisticians have been doing for centuries. Again, I see statistics as a branch of mathematics, and I use that to contrast statistics to what I learned in mathematics in the following couplet.

Mathematics is the science of discerning what is true.

Statistics is the science of quantifying what is likely to be true.

Who can forget Tom Cruise railing at Jack Nicholson: “I want the TRUTH!” Apparently Tom was a statistician before he was a lawyer.

Though much could be said about these two statements, for brevity I will make just a couple of points. In short, mathematics is about proof, certainty, true/false. In that sense it lives in the set {0.1}. Statistics is about uncertainty, probability/likelihood and evidence that exists in the shades of gray between true and false. In that sense it lives on the interval (0,1).

The fundamental difference between mathematics and statistics captured on the unit interval.

The hallmark of statistics is the “quantification of uncertainty,” which can be expressed as a probability, a likelihood, a confidence or other such measures. This is the key and the challenge for all data analytical sciences. I say this because in our modern world if one has data (who doesn’t have access to gobs of data?!) and a computer (who doesn’t have access to incredible computing resources?!), then one can shove the data through the computer and it will give you an answer. It must give an answer. That’s what it does. That is easy. What is hard is assessing the quality of that answer. That requires understanding the source of the data and its potential biases, some of which may be quite subtle. It requires understanding the possible confounding factors in the analysis, model fitting and over-fitting, multiplicity and spurious correlations and more. An answer without such thoughtful considerations is just an answer that leaves one hanging with, “Well, what do I do with that?”

If you have access to data and you have a computer, getting an answer is easy.

Assessing the quality of that answer is very hard.

In my experience (keeping in mind that it is in the pharmaceutical industry directly and healthcare in general), whether it was research scientists and clinicians, sales and marketing leaders, finance directors or even Presidents of business units, they wanted to know, “How likely is what you found to be real?” More specifically, they might ask:

  • Researcher: “How sure are you that this genetic variant is actually the one that distinguishes patients who will respond versus those patients who do not respond? Should I invest the next $40,000,000 in a clinical trial to investigate or prove this?”
  • Marketer: “How sure are you that this investment in marketing campaign X will actually increase sales? If I invest the next $10,000,000 in these ads or that slogan or this sales pitch, what is the return on investment?”
  • Finance: “How likely is it that if we move $25,000,000 from project X to project Y, we will increase revenue overall? How accurate is your forecast?”

Embedded in all these questions is the desire for quantification of uncertainty – not the elimination of uncertainty which all my research or business partners understood and accepted. They needed some quantification about the upside or the downside of their decisions.

Now, not all decisions involve multi-million-dollar investments or consequences. When the stakes are high, the full force of data analytical disciplines and methods and tools should be brought to bear on the quantification of possible consequences for the research/business leader’s decision. When the investments are smaller and the consequences less dramatic, quicker and simpler approaches may suffice and answers may be more speculative. However, even in those situations, some qualitative assessment as to the plausibility or reliability of the answer should be conveyed … like “Hey boss, this is a real SWAG, but it’s the best we could do in 24 hours” or “We think this is a pretty good answer given our past experience with …”

This is the notion of “fit for purpose.” I have heard it said in several contexts and it applies here: “Neither seek nor avoid complexity.” If a simple data analytical approach gives a good answer, then so be it. But if the problem at hand is difficult, complex with large numbers of variables, then dig in for the full, in-depth analysis. The problem is when the former is taken as an easy way out for the latter … when it’s easy to get an answer and say, “Here’s some possible relationships in our data. I did my job. Your job is to figure it out.” My business partners would never have stood for that. They were counting on us to give some measure of reliability … some quantification of uncertainty to help guide their decision-making.

  • Summary

We all know that data science is more than just data. In fact, most of data science is about analyzing data, not just manipulating it (see Blog #3: We Won’t Get Fooled Again … Or Will We?). So, I prefer to use a more apt descriptor for what we do – data analytical science.

Most of Data Science is about analyzing data.

So, let’s use a more apt descriptor – data analytical science.

Perhaps this is a more apt description for what we do.

I like to think of data analytical science as a big tent. There are many professionals using some form of mathematics to manipulate and analyze data. I do mean professionals – those who are formally trained in data analysis methodologies and tools. Let’s face it, almost everyone can claim they analyze data. Who doesn’t put facts and figures into a spreadsheet and sort, summarize or visualize it in some way? I think of data analytical scientist as those who specialize in some form of data analysis that goes beyond the simple descriptive or elementary statistics (e.g. linear regression).

Think of data analytical science as a big tent.

Math, Stats, Comp Sci, OR, Epi, PK/PD, Econ, Bioinformatics.

Lastly, analysis is very easy. Interpretation is very hard. If we are to be professionals in data analytical science, then we owe it to our science/business partners to provide some quantification of uncertainty that underlies our findings, conclusions, recommendations or predictions. At a bare minimum, some qualitative assessment should be mandatory.

Analysis is very easy. Interpretation is very hard.

In his famous play Romeo and Juliet, Shakespeare wrote of Juliet saying to her beloved Romeo, “That which we call a rose / By any other name would smell as sweet.” There is a sweetness to ALL disciplines involved in the analysis of data to solve problems. That sweetness derives its elegance from the beauty and versatility of mathematics and its utility from data driven decisions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s