What Is Data Augmentation? Enhance ML Training Data for Better Outputs

Low-quality, fragmented, or too little data can have a significant adverse impact on how well your AI applications perform.

This is especially the case for applications that rely on machine learning (ML).

Data augmentation helps you address this by generating more diverse data from existing datasets, helping you create more accurate and reliable machine learning models.

Learn more about data augmentation, the techniques used, and how it adds value to your AI-based projects.

What Is Data Augmentation? Definition and Real-World Example

IBM defines data augmentation as the use of “pre-existing data to create new data samples that can improve model optimisation and generalisability”.

To understand what this means in real life, we’ll take the example of a business finance application.

Example:

A finance team is seeking to implement an app to identify patterns in customer behaviour that may indicate fraudulent activity.
As it stands, the app’s un-augmented training dataset is imbalanced, in that cases of fraud are underrepresented. This raises the risk of the application’s ML algorithms becoming biased towards the majority class of legitimate transactions, and failing to detect fraud effectively.
More generally, there is an insufficient variety of fraud examples in the dataset for the app to become familiar with generalisable fraud patterns.
The creation of augmented data can help to address these challenges. Using the (relatively few) pre-existing data entries relating to fraud as a starting point, the team generates additional fraudulent transactions. These new examples feature a wider range of feature combinations (for instance, a wide range of transaction amounts, customer actions, locations, times, and device types).
Having been exposed to a wider range of data, the application now has a wider understanding of what constitutes possible fraudulent actions, enabling it to generate more useful and reliable results.

Also Read: What is Data Drift and How Can You Detect it?

What Is the Difference between Augmented Data and Synthetic Data?

difference between augmented data and synthetic data

Data augmentation involves the creation of new data by applying changes to existing data. By contrast, synthetic data is created from scratch; i.e. it creates entirely new instances that may look like real data but are not directly copied or transformed from it.

These two techniques for enhancing datasets are often used in conjunction with each other. So in the example given above, you might expand your dataset partly through data augmentation (for instance, by copying data relating to existing transactions and swapping some of the feature values). At the same time, you might create entirely new instances covering scenarios that are not yet represented in the dataset.

What Are Common Data Augmentation Techniques?

Techniques to use for data augmentation depend on the nature of the data you are seeking to transform (e.g. image, text, or tabular data). Common data augmentation techniques across these various data categories include the following:

Visual Data Augmentation

Visual ML applications are usually required to identify and differentiate between different types of objects. Feeding the model a range of augmented images helps it to become more effective at recognising image patterns in real-world scenarios, taking into account variables such as brightness levels, orientations, backgrounds, and object distortions. Augmented visual data often uses the following types of transformations:

Geometric reconfiguring. Rotating, flipping, scaling, or shearing images to simulate different positions.
Colour. Altering the brightness, contrast, and hue of images to provide examples of various lighting conditions.
Cropping and padding. Zooming in or out of existing images to simulate distance and scale, and to add context.
Noise addition. Deliberately adding elements of blur, graininess, or other distortions to provide the model with examples of poor-quality or noisy images.
Elastic deformation. Adapting images to stretch, compact, or warp them (useful for training models to recognise handwriting).
Random erasure. Erasing random parts of existing images to simulate image occlusions or instances of partial missing data.

Text Data Augmentation

Text data augmentation enables NLP models (natural language processing) to become more effective at tasks such as text classification, machine translation, and sentiment analysis. Common text augmentation methods include the following:

Sentence shuffling. Shuffling the order of sentences or clauses in a text data extract to familiarise the model on variations in word order.
Paraphrasing. Rewording an extract completely - often with completely different vocabulary - while preserving its meaning.
Deletion / insertion. Inserting or removing random words from text extracts to simulate user text entry errors, while keeping the context intact.
Back translation. Translating a text data extract into another language, and translating it back again. This often results in slightly different sentence structures, while preserving the original context.

Audio Data Augmentation

The following techniques are often used to train models in activities such as speech recognition, audio classification, and sound detection.

Pitch shifting. Altering the pitch of audio data samples without altering their speed, so as to simulate a range of different tones.
Speed variation. Altering the speed of audio (useful for speech tempo recognition).
Noise addition. Inserting background noise or reverb into existing audio data extracts to simulate a range of real-world conditions.
Volume adjustment. Increasing and decreasing the volume of audio extracts to simulate a range of recording or transmission conditions.

Time Series Data Augmentation

Particularly valuable when training models for forecasting and predictive analytics applications, time series data refers to a sequence of data points or observations collected or recorded at specific intervals. Exposure to a wide range of time series data enables models to become more effective at identifying trends, predicting fluctuations that occur at specific intervals (seasonality), interpreting cyclic patterns, and identifying irregularities. Techniques for time series data augmentation include the following:

Time warping. Either stretching or compressing an existing dataset’s time series in both directions to simulate changes in speed. For example, a business might experience a slow period, followed by a sharp increase in sales. Time warping can help train an enterprise financial planning and analytics application to recognise these shifts, and predict future trends more accurately.
Window slicing. Creating smaller time segments from an existing time series in order to provide additional simulations of specific snapshots in time.
Jittering. Adding small amounts of erroneous or anomalous data to the existing time series, teaching the model to become more robust when confronted with input variations.
Seasonality simulation. Using existing time series data as a starting point to create further examples of seasonal patterns, enables the model to become better at recognising these over time.

The Business Case for Data Augmentation

As a rule, the more variety and volume of high-quality data you expose your ML application to, the better it performs.

In reality, however, it is very unusual for an organisation to be in a position to put together the ‘perfect’ training dataset solely from existing data. The creation of augmented data addresses this. By taking the data you have already and using it to generate a rich seam of additional samples, you can significantly improve model performance and reliability.

What Is Data Augmentation? Techniques and Benefits Explored

What Is Data Augmentation? Definition and Real-World Example

What Is the Difference between Augmented Data and Synthetic Data?