Drinking coffee might be associated with longer life … or shorter life.
It is difficult to overstate the ubiquity of big data and data science in our society. There are proclaimed and actual success stories promoted across the web (https://blog.storagecraft.com/real-data-analytics-success-stories/) and major flops (“The Parable of Google Flu: Traps in Big Data Analysis,” David Lazer et al, Science, 14 Mar 2014). It is helpful to understand a bit more about the emergence of the data science phenomenon so that we may clearly acknowledge where it does have value and what parts of it should be aligned with statistical science or what I might call data analytical science.
It would be easy to focus on the many misapplications of statistics (i.e. good data analytical science) in the process of producing questionable results or even erroneous conclusions. I found a relevant statement in this regard from an article in the October 21, 2013 issue of The Economist (cover page – “How Science Goes Wrong”):
“Models which can be ‘tuned’ in many different ways give researchers more scope to perceive a pattern where none exists. According to some estimates, three-quarters of published scientific papers in the field of machine learning are bunk because of this ‘overfitting’, says Sandy Pentland, a computer scientist at the Massachusetts Institute of Technology.”
HOW and WHERE Did Data Science Arise?
So, what is data science as it is used today and how did it arise? My perspectives here have legitimate grounding in that I observed this in my business and other businesses around me, but I also acknowledge that there are other perspectives that may shed a different light on the subject. As with all social phenomena, large social movements arise in very complex ways and with many influences, and any attempt to distill such a movement into a simple story line does not do it full justice. Nonetheless, here is my perspective in a condensed narrative.
At the beginning of the millennium, the conduct of business through the internet coupled with the rapid advances in the capacity of computer storage led to a greater than exponential growth of available data. This launched the era of Big Data. It was the realm of Information Technology (IT) professionals to set up computer systems for collection, storage and retrieval of this information. With access to the data and the knowledge of how to manipulate it (so-called “data wrangling” or “data engineering,” which are actually titles or terms that people use to describe themselves these days), IT professionals were able to provide basic reports (i.e. descriptive statistics) to business leaders to help them understand what was happening with their business and where it was going. With the appearance of the landmark book, Competing on Analytics (Davenport and Harris, 2007), the business community had a clearer playbook for moving up the “analytics maturity curve” beyond descriptive statistics to greater levels of sophistication described as predictive analytics and optimization.
Now, many in the business community were completely unaware that there was a discipline of statistics; those that were aware thought that statistics was an obscure discipline relegated to scientific research or an unrelated discipline associated with government statistics or simple sports accounting. [Space does not permit me to provide some rather amazing stories of perplexing conversations I had with business leaders about statistics and statisticians]. So, where we as a profession failed to capitalize on an enormous opportunity, the computer scientists stepped in. Nature abhors a vacuum. They began to turn their access to data into not only reporting but analysis … but very often analysis without fundamental training or understanding of data analytical science (i.e. statistical) principles. By 2010, the term “data science” or “data scientist” was coming into vogue and by 2013, the “new discipline of data science” was firmly entrenched in business (Wall Street Journal, Apr 12, 2013, “Data Science Emerges as New Discipline”). Of course, it was spreading and has spread beyond the business world, and now many different people with many different backgrounds in many different fields identify themselves as Data Scientists, if for no other reason than to market themselves for job opportunities and perhaps higher salaries.
WHAT is a Data Scientist?
Many have struggled to describe what a data scientist is, and the WSJ cited above does NOT list statisticians as making good data scientists, though it does acknowledge that mathematics and statistics are its foundations. I captured the descriptions by 14 prominent, self-proclaimed data scientist (people who have written books or are regular bloggers on the topic) and created a word cloud of their descriptions, which is shown below.
Immediately apparent is the word “data” for obvious reasons, but note that two of the next biggest words are “business” and “insight,” which supports my thesis above about the emergence of data science in the business world through IT. Upon closer inspection, one can see that “statistics” and “statistician” play a very minor role (no more than a “computer”), even smaller than the word “someone,” indicating that the field is open to almost anyone! Words that are fundamental to the science of statistics, or data analytical science, such as uncertainty, probability, variability, prediction, design, error, confidence or inference are absent or so small that one must look very closely to find them.
One particular data scientist, Bill Schmarzo, is the Chief Technology Officer of EMC Global Services, a prolific blogger, author of two books on big data and data science. I have also seen him referred to as the Dean of Big Data (and sometimes by association the Dean of Data Science or a person with deep expertise in analytics). I have met Bill in my former job where he came to speak to a broad coalition of analytically minded people – statisticians, business analysts, engineers from manufacturing optimization, pharmacokineticists, epidemiologists and financial analysts.
Bill’s definition of a “data scientist” has morphed gradually over time. On 5 Jan 2015, Bill’s blog Thinking Like a Data Scientist – Part 1 stated, “The goal of the ‘thinking like a data scientist’ process is to identify, brainstorm and/or uncover new variables that are better predictors of business performance.” Of note, is the phrase “business performance,” which again, further supports my thesis on the origins of data science. (http://www.grroups.com/blog/thinking-like-a-data-scientist-part-i)
On 25 Dec 2017 (Christmas Day?!?!?), Bill posted a slightly different definition in a blog titled Redefined Thinking Like a Data Scientist Series: “Data science is about identifying variables and metrics that might be better predictors of performance.” (https://infocus.dellemc.com/william_schmarzo/refined-thinking-like-a-data-scientist-series/ )
There are two subtle changes of note: (1) he dropped the word “business” implying that he was enlightened to the notion that analyzing data could go beyond the business world, which is an improvement in thinking (albeit perhaps dangerous); (2) “are better predictors” in the first definition was changed to “might be better predictors.”
I told Bill that I completely agreed with this latter definition, which he shard at our business meeting, and that what I saw coming from self-proclaimed data scientists was exactly this mindset and approach. I also told him that I thought the discipline defined this way was not very useful. The key words are “might be.” I told Bill that I could never go to my management, whether it be business or scientific, with such a vague assessment. People want more than “might.” They want to know:
- How likely are these variables to be predictors?
- If I make a decision based on these variables that might be important, how likely am I to succeed or achieve my goals?
- What confidence is there in this finding of these variables as being important?
- What biases are there in the data that could make identification of these variables suspect?
The list goes on.
In fact, without even having any data in front of me, I can talk about what might be.
- Using SPF 50 sunscreen might be associated with skin cancer.
- Exposure to rosiglitazone might be associated with bladder cancer.
- A diet consisting of more fruit might help you live longer.
If you have access to data (anyone who wants to get it can these days) and a computer (everyone has one), then you can slam the data through a computer algorithm, and you WILL get an answer – what might be (see my blog on Association, Correlation and Causation). It is easy to get an answer; it is very difficult to assess the quality of that answer. I submit that data science, as routinely practiced today, is about getting answers. Data analytical science is about getting answers, BUT MORE IMPORTANTLY, assessing the quality of such answers (i.e. the reliability, validity, reproducibility, quantifying uncertainty, etc.). The data and the computer are agnostic. They don’t care what you are doing, what problem you are addressing (if any), what patterns you are trying to find or whether the answer is true (i.e. 1), false (i.e. 0) or somewhere in between (i.e. a probability between 0 and 1).
In our present time, the biggest problem is not analyzing data, but rather interpreting data. That’s the hard part and the value that a data analytical scientist (i.e. statistician) can bring. It takes thoughtful considerations and deep knowledge about issues of bias, over-fitting, multiplicity, spurious correlation etc. to understand the credibility of an answer … not to mention some understanding of the theory and assumptions that go into any algorithmic technique or statistical model.
It is here that I must note that statisticians do not always think clearly about such issues, and I must confess to my fair share of misinterpretations of data over my many years in the pharmaceutical industry. Probability, uncertainty or inferring what is true or false from data is extremely difficult. Mistakes are part of humanity, and I can only hope that my years of experience have taught me to be more thoughtful and have given me better insights into how to interpret data and analytical results.
It is easy to criticize Data Science for some of their mistakes and approaches that do not consider basic, fundamental statistical or probabilistic principles. However, they filled a void that many statisticians neglected, ignored or (perhaps) chose not to fill or refused to fill – a willingness to tackle big messy data, provide approximate answers to important questions, etc. I must look myself in the mirror when I give these criticisms and ask myself why I failed to act more decisively or aggressively over the last decade during the emergence of data science.
As Data Science programs are arising in many universities, there are some laudable efforts for Stats Departments to collaborate with them (e.g. Purdue). However, this appears to be the exception rather than the rule. Even when I have seen such programs at Purdue, the statistical elements of the Data Science curriculum are merely teaching more statistical methods without the critical thinking that is needed in today’s world that is awash with data. What is the source of the data and what biases might it contain? How do I handle massive multiplicity issues in variable selection amongst thousands of variables? How do I make causal interpretations of any findings?
Just like me, I am sure you can identify many examples – some good and many bad – of how data has been used (or misused) to solve a problem in some arena of our society. We need to be optimistic about our data and analysis and simultaneously skeptical about our interpretations. I am reminded of the quote from H. L. Mencken (Prejudices: Second Series, 1920), “There is always a well-known solution to every human problem—neat, plausible, and wrong.”