Discovery, Truth and Utility: Defining ‘Data Science’

Share

Gregory Piatetsky-Shapiro knows a thing or two about extracting insight from data. He co-founded the first Knowledge Discovery and Data Mining workshop in 1989 that we briefly discussed in the second installment of this series of blogs. And he has been practicing and instructing pretty much continuously since then.

But what is it, exactly, that he has been practicing? Even Piatetsky-Shapiro might struggle to give you a consistent answer to that question, as this quote of his from 2012 hints:

Although the buzzwords describing the field have changed – from ‘knowledge discovery’ to ‘data mining’ to ‘predictive analytics’, and now to ‘data science’, the essence has remained the same – discovery of what is true and useful in mountains of data.

We like this quote a lot. Firstly, because it speaks to the fact that historically we have used at least four different terms - knowledge discovery, data mining, predictive analytics and data science – to describe substantially the same thing. The tools, techniques and technologies that we use continue to evolve, but our objective is basically the same.

And the second reason that we like this quote so much is because it contains three words that we think are key to understanding the analytic process.

Discovery. True. And Useful.

Let’s take each of these in turn.

Analytics is fundamentally about discovery. It’s about revealing patterns in data that we didn’t know existed – and extrapolating from them to try and know things that we otherwise wouldn’t know.

In fact, the analytic discovery process has more in common with research and development (R&D) than with software engineering. If we are doing it right, we should have a reasonably clear idea about the business challenges or opportunities that we are trying to address - for example, we may want to try and measure customer sentiment to establish if it is correlated with store performance and to understand which parts of the shopping experience we should try to improve to increase customer satisfaction. Or we might want to predict the failure of train-sets based on patterns in sensor data. But often we won’t know which approach is likely to be most successful, whether the data available to us can support the desired outcome – or even whether the project is feasible at all. And that means - first and foremost – that whatever we call it, analytics is about experimentation. Repeated experimentation. As Foster Provost and Tom Fawcet put it in their (excellent) textbook Data Science for Business: “the results of a given step may change the fundamental understanding of the problem.” Traditional notions of scope and requirements are therefore often difficult to apply to analytics projects.

Secondly, whilst many process models have been developed to try and codify the analytic process and so make it more reliable and repeatable – of which the Cross Industry Standard Process Model for Data Mining (CRISP-DM) shown below is probably the most successful and the most widely known – the reality is that analytics is an iterative, rather than a linear process. We can’t simply execute each step of the process in-turn and hope that insight will miraculously “pop” out of the end of the process. An unsuccessful attempt at modelling, say, customer propensity-to-buy, may cause us to re-visit the data preparation step to create new metrics that we hope will be more predictive. Or it may cause us to realize that we are insufficiently clear in our understanding of the business problem – and require us to start over. One important outcome of all of this is that “failure” rates for analytics initiatives are high. Often, these “failures” really aren’t failures in the traditional sense at all – rather they represent important learning about which approaches, tools and techniques are relevant to a particular problem. The industry refers to this as “fail fast”, although it might be more appropriate to call it a “learn quick” approach to analytics. But whatever we call it, this high failure rate has important consequences for the way we organize and manage analytic projects that we will return to later in this series.

There are many ways in which data can mislead, rather than inform us. Sometimes we can find results that appear to be interesting, but that are not statistically significant. We may conflate correlation with causality. Or we may be misled by Simpson’s paradox. Paradoxically, as Kaiser Fung points out in his book Numbersense, big data can get us into big trouble, by multiplying the number of blind alleys and irrelevant correlations that we can chase - and so causing us to waste precious time and organizational resources.

But something even more basic can also trip us up: data quality. The most sophisticated techniques, algorithms and analytic technologies are still hostage to the quality of our data. If we feed them garbage, garbage is what they will give us in return.

We cannot automatically assume that data are “true” – in particular, because the data that we are seeking to re-use and re-purpose for our analytics project are likely to have been collected to serve very different purposes. Analytics of the sort that we are undertaking may never have been intended or foreseen. That is why the CRISP-DM model places so much emphasis on “data discovery”; it is important that we first understand whether the data that are available to us are “fit for purpose” – or if we need either to change our purpose and/or to get better data.

Defining data science

So how then, should we define data science? Spend 10 minutes with Google and you will find plenty of contradictory definitions. Our personal favorite is –

Data Science = Machine Learning + Data Mining + Experimental Method

It may lack mathematical rigor, but it’s short, sweet – and, if we say so ourselves - spot-on!

(Author):
Martin Willcox

Martin leads Teradata’s EMEA technology pre-sales function and organisation and is jointly responsible for driving sales and consumption of Teradata solutions and services throughout Europe, the Middle East and Africa. Prior to taking up his current appointment, Martin ran Teradata’s Global Data Foundation practice and led efforts to modernise Teradata’s delivery methodology and associated tool-sets. In this position, Martin also led Teradata’s International Practices organisation and was charged with supporting the delivery of the full suite of consulting engagements delivered by Teradata Consulting – from Data Integration and Management to Data Science, via Business Intelligence, Cognitive Design and Software Development.

Martin was formerly responsible for leading Teradata’s Big Data Centre of Excellence – a team of data scientists, technologists and architecture consultants charged with supporting Field teams in enabling Teradata customers to realise value from their Analytic data assets. In this role Martin was also responsible for articulating to prospective customers, analysts and media organisations outside of the Americas Teradata’s Big Data strategy. During his tenure in this position, Martin was listed in dataIQ’s “Big Data 100” as one of the most influential people in UK data- driven business in 2016. His Strata (UK) 2016 keynote can be found at: www.oreilly.com/ideas/the-internet-of-things-its-the-sensor-data-stupid; a selection of his Teradata Voice Forbes blogs can be found online here; and more recently, Martin co-authored a series of blogs on Data Science and Machine Learning – see, for example, Discovery, Truth and Utility: Defining ‘Data Science’.

Martin holds a BSc (Hons) in Physics & Astronomy from the University of Sheffield and a Postgraduate Certificate in Computing for Commerce and Industry from the Open University. He is married with three children and is a solo glider pilot, supporter of Sheffield Wednesday Football Club, very amateur photographer – and an even more amateur guitarist.

View all posts by Martin Willcox

Follow Connect

(Author):
Dr. Frank Säuberlich

Dr. Frank Säuberlich leads the Data Science & Data Innovation unit of Teradata Germany. It is part of his repsonsibilities to make the latest market and technology developments available to Teradata customers. Currently, his main focus is on topics such as predictive analytics, machine learning and artificial intelligence.
Following his studies of business mathematics, Frank Säuberlich worked as a research assistant at the Institute for Decision Theory and Corporate Research at the University of Karlsruhe (TH), where he was already dealing with data mining questions.

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International. Frank has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

His professional career included the positions of a senior technical consultant at SAS Germany and of a regional manager customer analytics at Urban Science International.

Frank Säuberlich has been with Teradata since 2012. He began as an expert in advanced analytics and data science in the International Data Science team. Later on, he became Director Data Science (International).

View all posts by Dr. Frank Säuberlich

Follow Connect