em360tech image

Data science is a practice and a whole body of knowledge that allows scientists and practitioners to probe how different systems work. I understand this is not too revealing and probably not what you expected from a post that preaches to demystify data science, but bear with me and by the end of this article, I guarantee that you will have a good grasp of what data science is.

Without further ado, let’s begin. When I was researching this article, I started where I normally start. I ran a quick search on Google for the term “what is data science”. Low and behold, to my surprise I got about 830,000 results. Some were more informative than others. However, the sheer number of results is enough to make someone wonder what is this field that everyone keeps mentioning.

Going deeper into this search here are some of the definitions I found.

According to IBM,

“Data science is a multidisciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today’s organizations. Data science encompasses preparing data for analysis and processing, performing advanced data analysis, and presenting the results to reveal patterns and enable stakeholders to draw informed conclusions.” [1]

Berkeley defines data scientists as data professionals who

“are able to identify relevant questions, collect data from a multitude of different data sources, organize the information, translate results into solutions, and communicate their findings in a way that positively affects business decisions.” [2]

More, David Donoho, professor of Statistics at Stanford, defines data science as “the science of learning from data, with all that this entails.” [3]

Finally, Wikipedia tries to formulate a more encompassing definition of data science from multiple perspectives as

“Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.” [4]

Wonderful, right? I am not sure about you, but these definitions don’t do anything to clarify what is data science for me. If anything, they make it even more intriguing and opaque. So, I decided to go a different route, and start from the beginning.

A historical incursion into data science

Data science’s birth starts a few decades ago, possibly centuries when statistics and probability theory were born. That moment marked a turning point in our approach to studying the world around us. We started to understand that things are not all deterministic but rather there is an element of chance in them. From that moment forward, we started to develop methods to extract meaning from practical observations of different phenomena. But, the number of observations was yet relatively small in comparison with the data available today.

Fast forward to about the 1960s and 1970s, when computer science started to be used alongside statistics and database management systems were first introduced. This marked the beginning of the marriage between statistics and computer science. From this moment onwards, people started to collect and store vast amounts of data. Warning! They pale in comparison with the capabilities of current systems, approximately 2.5 quintillion bytes of data created each day [5].

At this stage, the story splits into two narratives: a commercial and a science-centric story.

The commercial narrative

As data accumulated it led to an explosion of business data and the invention of the term big data analysis (another cryptic term, I know, which I will try to demystify in a different post) in the early 1990s. When these large amounts of data started to pile up, people started to wonder if they could be of any use.

In parallel to the data explosion, other things were happening. Computers were becoming better and cheaper. We were able to process larger amounts of data at lower costs, and the perspectives were looking great.

More, as computers became widely available, more and more people turned their attention towards how to make them do cool things, so the development of new algorithms intensified.

Finally, data collection mechanisms and devices such as sensors, availability and capabilities increased which meant that more sources of information started to become available leading in terms to even more data being collected.

Taken together, these events led businesses to employ statisticians, mathematicians and scientists to try to create value from the massive amounts of data they held. In other words, companies and governments realised that they could make something more out of the data they would store anyway.

From this point, everything is history. The world saw an increasing number of developments and applications centred around the use of mathematics, statistics, and computer science to process and analyse large data sets, from new and improved algorithms to fully automated decision making systems.

The scientific narrative

The idea of data science first appeared in academic circles among statisticians during the 1960s when the famous John Tukey made an open call to reform the rather conservative statistics [6]. He was then concerned with where statistics were headed and pointed to an unrecognised science interested in learning from data. Following this, during the 1980s, John Chambers and Leo Breiman also militated for statisticians to expand their boundaries and focus more on data preparation and presentation versus modelling (Chambers) or prediction versus classical inference (Breiman). Around the same time, Bill Cleveland introduced the name “Data Science” for the newly envisioned field[3].

This notion of data science as a science stemmed from the idea that academic statistics should focus more on learning from data. According to Donoho, these endeavours resulted in six different activity streams: (1) data exploration and preparation, (2) data representation and transformation, (3) computing with data, (4) data modelling, (5) data visualisation and presentation, (6) science about data science [3]. Let’s have a look at each of these streams.

Data exploration and preparation refers to the effort of exploring the data to sanity-check basic properties and to reveal unexpected features, removing anomalies and artefacts through operations like grouping, smoothing, or subsetting.

Data representation and transformation represents the work required to convert certain types of data into defined mathematical structures which are subsequently used for modelling or analysis. For example, transforming a 2D image into a 1D vector.

Computing with data groups practical and theoretical aspects of computer science, such as programming languages, coding, distributed and cloud computing, or big data technologies.

Data modelling is generally split into two parts. First one can talk about generative models, where one proposes a stochastic model to explain the process generating the observed data and the derived methods for inferring properties about how the data was generated. Second, some methods predict well based on a given data set and can generalise to perform well when new, unseen data are tested. This approach falls under modern machine learning.

Data visualisation and presentation represent theoretical and practical aspects of visualising and making inferences from the data’s visual representations.

Science about data science refers to activities such as identifying, evaluating or documenting different analysis or processing workflows.

Therefore, the scientific narrative stems from a necessity that arose within the academic statistics departments whereby as data became available a few scientists pushed for the development of new methods and techniques to learn and to uncover hidden insights from the data available.

The technical perspective

Data science, as the name entitles, reunites two concepts: data and science.

Let’s talk first about data. The big revolution in the field came about when big data technology became mainstream. Today, data science relies on a myriad of enabling technologies from big data analytics to visualisation tools.

Below I will list the most common technologies enabling data science today. I will not spend too much detail on each one as each deserves to be covered in a series of articles.

● Big data analytics

○ Leverage distributed computing and analytics to investigate data sets that have a large volume, high velocity, and high variety.

○ The data is no longer just a collection of numbers into rows and columns.

Today we collect and analyse all types of data from structured data in spreadsheet-like formats to unstructured data such as images, text, or sound waves.

Distributed computing

○ A way of computing in which a task is broken down into smaller pieces that can be handled individually. This way of computation can be geographically dispersed. This is made possible by advances in cloud computing which enables low cost, scalable distributed computing technologies.

● Data analytics programmes

○ Once the data processing pipelines are set up, different methods are used to ask questions and/or to probe hypotheses about the behaviour of the system generating the data.

● Other technologies

○ Data infrastructure technologies that affect how data is shared, processed and consumed;

○ Data management technologies which define how structured and unstructured data are handled;

○ Visualisation technologies are critical for effectively communicating results with nonexperts and stakeholders.

The science part of data science refers to what we do with the data. Traditionally, when people looked at data to find answers, they started with a hypothesis. Following this hypothesis, they would decide what and how much data is needed to either confirm or refute it. This approach is driven by the hypothesis and intentionality.

With the advent of big data, there was also a paradigm shift. Today we don’t always start with a hypothesis in mind. Instead, we let the data talk to us, to tell us its story. For this aim, we employ different algorithms, statistical methods or mathematical models to discover the hidden patterns in the data. There is no hypothesis anymore, we just try to find patterns without prior knowledge or experience.

Thus, data science is a combination of everything related to data collection, storage, and processing and science with hypothesis testing or pattern discovery and recognition.

The process perspective

In addition to the historical and technical perspectives, you can also look at data science from a process point of view. One can consider data science as the process of turning raw data into meaningful knowledge or insights. [check book on soft systems for knowledge vs information]

Taking this view, data science can be viewed as a four-step process (1) planning, (2) wrangling, (3) modelling, and (4) applying.

Planning

Planning is all about the initial phases of a project. This stage involves goal definition, problem framing, organising resources (tools, office space, and people), coordinating and allocating the work involved, and finally scheduling the project.

Usually, this phase requires a great deal of business knowledge to be transferred from stakeholders to the project team in addition to all the other activities.

Wrangling

Now that you know what you want to achieve, you move to the next phase: the data. During this phase you usually go through multiple iterations of getting data, cleaning data, exploring data, and then refining data.

Here the domain-specific knowledge and business acumen will play a critical role alongside your technical skills to determine what data you need, how and where to get it from, and if the data you have can be used to achieve your goal.

Modelling

This is the most fun part (at least for me). Here you get to showcase all your creativity and technical skills. During this phase, you create a model (e.g. SVMs, regressions, decision trees, mixed Gaussian models, artificial neural networks, etc) to capture the observed behaviour of the system you are analysing.

Next, you validate the model to find out how good it generalises. In other words, you want to make sure that your model works on new and unseen data. If everything works well, and from my experience, it never does from the first try, but let’s assume for the sake of the argument that it does, then you move on to evaluate the model. Here you want to find out how good it fits the data and what is its ROI to decide on implementing this model in your live system.

Finally, if any of these steps fail, you go back to the drawing board and refine your model until you either find a good model that captures the behaviour of your system or you give up, lose funding, or move to a different project.

Applying

You thought you were done? You created a model with decent generalisation and ROI and that’s it? I’m sorry to disappoint you, but your work doesn’t end here.

After you created, validated, and evaluated the model, you need to present it to your stakeholders (managers, clients, etc). Then you deploy it, actually put this model into production to start creating value for the business or clients.

Next, you archive all your assets. You write documentation, comments, and other information required for someone who sees your model for the first time to understand, maintain, or upgrade it.

Finally, you keep monitoring the model. As you monitor and retrain the deployed model, you want to keep an eye on the data coming in, the robustness of the model, and any errors that might appear. If things go haywire, what do you do? I’m sure you guessed it by now, you go back to the drawing board and check all the assumptions you’ve made during the previous steps.

The bare minimum to become a data scientist

In the previous sections, you saw that there are many moving parts and specialised knowledge one might have as required to launch into data science. But, that is not the case. When you first start, you don’t need to know all the tools and technologies available to you. All you need is to have a strong foundation.

That foundation sits on three pillars: (1) mathematics, (2) computer science, (3) domain-specific knowledge. Let’s take a look at them.

Mathematics

I know, daunting. Most people have nightmares about it. However, to become a successful data scientist you can not know maths. When I say this, I don’t mean you need to be a top pure mathematician. Far from it, all you need is a solid understanding of the main concepts in calculus & optimisation, linear algebra, probability theory, and statistics. If you master them, then you will be capable of expanding your knowledge and understand even the most complex algorithms available.

Computer science

Here things are similar too. As a data scientist, you are more of a hacker. Most of your work will be done on developing algorithms, mathematical models, or prototypes. Thus, you need to be proficient in a programming language (most commonly used are Python and R). More, you have to understand fundamental programming concepts such as data structures, algorithms, or different programming paradigms. Finally, you have to make databases your friends. All the data is stored somewhere, and that is very likely to be a relational or non-relational database. As such, if you want to access and efficiently work data, you can’t go by without being proficient with databases.

Domain-specific knowledge

This is the toughest part if you ask me. Most of your work will be related to a specific domain, like predicting customer behaviour in telco, creating recommendations for supermarket shoppers, or securing systems from harmful attacks. For you to be successful at analysing those data, discovering insights, or creating models, you will need to understand the nature of the data available. In other words, you need to understand how the system you are analysing works, or at least know how others think it works. This gives you a massive edge when you analyse and model data because you will develop an intuitive understanding of the data. But, you need to also be careful. Although having an intuitive understanding of the system helps you avoid falling into the cognitive bias trap. Thus, to be successful in data science, you need to understand the workings of the system you are scrutinizing, but always be objective about your analysis or the models you are developing.

In conclusion, data science is a field where we use computers, mathematical and statistical tools and techniques to make sense of the world around us. Furthermore, defining data science depends greatly on the point of view. For example, academics might see data science as learning from data, while businesspeople might look at it as big data analytics and the encompassing technologies. More, we saw that data science sits at the union between mathematics, computer science, and domain-specific knowledge. Thus, data science is a looking glass that we hold in our “back pocket” and which we can always take out and point at different processes or systems to get insights into how they work or make predictions about their behaviour.


About the author

Andrei Luchici is an artificial intelligence and machine learning advisor, trainer, and mentor. He is passionate about helping people and companies on their road to AI. Andrei started his career in ML by conducting research into how biological cells migrate in vivo. Right after his PhD, he co-founded a boutique consulting company, Dacian Consulting, where he helps startups and enterprises on their road to AI.

Recently, Andrei co-founded the Center for Intelligent Machines to conduct open research, work on AI innovation projects, advise companies on AI strategy, and design and conduct educational programmes in the field.

References:

[1] https://www.ibm.com/cloud/learn/data-science-introduction

[2] https://ischoolonline.berkeley.edu/data-science/what-is-data-science/

[3] http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

[4] https://en.wikipedia.org/wiki/Data_science

[5] https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/

[6] https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-33/issue-1/TheFuture-of-Data-Analysis/10.1214/aoms/1177704711.full