em360tech image

When you set out on a journey, do you charge ahead with no idea of your destination or the best route? The likely answer is no, and the same applies to analysing a data set. For data teams, exploratory data analysis (EDA for short) is critical in analysing and actioning big data. Much like how you’d consider different travel modes and routes, exploratory analysis empowers data scientists to uncover the best data models and hypotheses to test.

Exploratory data analysis is defined as a set of techniques and tools that a data science team will use to understand a new data set’s characteristics, patterns and potential hypotheses. It’s a foundational part of every data project, taking approximately 80% of a data scientist’s time to prepare and explore the data.

What is the purpose of exploratory data analysis?

There are many reasons why data teams may perform exploratory analytics but it ultimately boils down to efficiency and insight. EDA allows a data scientist to quickly get a handle on a new data set, including the work needed to prepare it for deeper analysis and the models most suited to extracting insights.

While undertaking EDA, data scientists can begin to assess how data connects, allowing them to check their pre-existing assumptions and draft hypotheses to guide later data examination. Preliminary models can also be selected due to this.

During exploratory analysis, the quality and accuracy of data are checked, ensuring the best possible results later on and building stakeholder trust in the data. This is vital as bad data costs organisations, on average, U.S. $12.9 million per year due to inaccurate results, missed revenue opportunities and decreased productivity.

Indeed, video game software development company Unity lost an estimated U.S. $110 million due to low-quality data. Its platform ingested bad data from a large customer that impacted the accuracy of its machine-learning algorithm. This not only affected growth but also cost the company money in remedying the Audience Pinpointer algorithm — causing Unity stock to drop by 37%. Would the bad quality data have been spotted during a robust exploratory analysis? Most likely. It underscores why taking the time to do an initial analysis of new data can save a lot of time, resources and stress later on.

Exploratory data analysis techniques

Given that exploratory analysis is, as its name implies, an exploration of your data, it should be little surprise that there are many different exploratory data analysis techniques that your data team can take for a successful EDA. Your chosen EDA steps will also differ with each data set. However, there are some common exploratory data analysis steps that repeat for every project.

Step 1: Understand your EDA data sources

Knowing where your data comes from is the first step in assessing its accuracy and the work needed to prepare it for analysis. These sources can include databases, spreadsheets, text files, or web scraping tools.

Step 2: Check the data structure

Understand the data set’s general shape and format. If consolidating different data types, consider how you can combine the data sets so an algorithm can analyse it. By the end of this stage, your data should be organised into a format that’s easy to work with, such as a table.

Step 3: Assess the data’s quality

At this point, look for any missing values, duplicate data, unwanted entries or other common data quality issues that may derail your future data analysis.

Step 4: Understand the content of the data

Once you’ve cleaned, formatted and checked your data, you can do preliminary analyses to understand more about the data’s features, values and how this relates to each data set. Depending on the amount and type of data, you may use univariate, bivariate and multivariate analysis. Spatial, text and time series analysis are other options.

You may bring in data visualisation tools and techniques here to help you explore patterns and relationships in the data. This can also help set stakeholder expectations and gain buy-in for deeper analysis.

Exploratory analytics is an iterative process that you may repeat several times to fully understand and process the data. Since 80% of a data scientist’s time is spent on preparing data, don’t be alarmed if your data team spends a significant time completing EDA as it’ll likely save time and resources in the future.

Exploratory data analysis tools

Similar to how EDA steps are unique to each data set, no one tool will be consistently used across all data projects. Your data science team’s mindset towards EDA will be more influential on the success of your data analysis than the tools they choose to use. Having an open and curious approach to data creates the right culture for EDA to thrive.

Some tools you can come across during the EDA process include:

Box plot: useful for demonstrating the locality, spread and skewness of two or more groups of numerical data.

Histogram: helpful to understand the distribution of quantitative data within user-defined ranges.

Multi-vari chart: a visual way to assess variation in your data.

Run chart: this will tell you trends and patterns over time.

Pareto chart: containing both bars and a line, this chart can tell you the frequency in data, allowing you to prioritise.

Scatter plot: dots represent values and can help to picture the relationship between two variables.

Stem-and-leaf plot: numerical data is represented by a ‘leaf’ which is usually the last digit in a series and a ‘stem’ which is usually the first data point. It can help you understand the shape of data.

Parallel coordinates: this line plot will help you analyse the relationships between multiple variables.

Odds ratio: this measures the strength of an association between two events occurring. An exploratory data analysis example would be the likelihood of a customer in one demographic making a repeat purchase versus another customer in a different demographic.

Targeted projection pursuit: this is an interactive data exploration technique specific to high-dimensional data which aims to find features or patterns of interest.

Heat map: this will show how each variable in a dataset is correlated to each other.

Bar chart: you can use this to compare data in different categories.

Horizon graph: this two-dimensional visualisation can tell you about data over a continuous period.

Interactive versions of some of these graphs and charts also exist and can be used to toggle between different data sets or initial insights.

There are several languages a data science team can employ during EDA, from Python and R, to JMP, KNIME, Orange and Weka. The language selection will depend on the nature of your data set and what you’re trying to interrogate, as well as the languages commonly used in your business and your data scientists’ knowledge.

How EDA fits into your team’s workflows

You can expect to perform an EDA in some form within every data project and the techniques, tools and resources used will change depending on the size and quality of your data set, what the data will be used for and how it aligns to wider business initiatives and goals. If the data is going to feed into a business-critical algorithm, as in Unity’s case, then you’ll want the team to leave no stone unturned in their exploratory analysis. If the data is being analysed to quickly group employees for a company trip, then your EDA is likely going to be shorter. Sometimes it’ll be better to do a quick data preparation and exploration, before moving on to modelling swiftly.

It’s also worth remembering throughout exploratory data processes that not everything that appears interesting in your data needs to be followed up on. Many well-intentioned data projects have suffered from scope creep thanks to misplaced curiosity.

Key takeaways

The majority of your exploratory data analyses will take time, but it will be well spent. The long-term benefits of first preparing and understanding your data, then deciding on the best course of action cannot be overstated. Doing this foundational activity can help you better tailor investigations and align analysis with business goals, uncover hidden truths and build trust in your data.