British astronaut, Major Tim Peake, takes data to new dimensions as he closes out Big Data LDN 2023
Behrad Babaee, Technology Evangelist at Aerospike, examines how organisations can manage the need to hold on to data.
Businesses rely on data more than ever before. Whether to make decisions in the boardroom, provide information to customers and suppliers, or simply keep the wheels of business moving by passing data between systems – data matters. Back in 2021, the amount of data generated worldwide was estimated at 79 zettabytes (that’s 21 zeros) and was expected to double by 2025 according to Statista.
Whilst the need to analyse and report on data has always been important, artificial intelligence and machine learning are taking this to a new level, enabling businesses to review vast amounts of data, identify relationships in between seemingly unrelated data points and drive competitive advantage.
AI data gluttony
In 2017, everyone was talking about Google DeepMind’s AlphaGo and its abilities at Go! and Chess. By 2019, DeepMind’s AlphaStar was beating 99.8% of human players at StarCraft II and had achieved Grandmaster status. In the last two years we’ve seen huge advances in the abilities of Large Language Model (LLM) AIs with the arrival of ChatGPT, Google Bard and others, as well as Generative AI tools such as MidJourney and DALL-E 2, producing breathtaking images from user prompts.
A recent study by McKinsey & Co, estimates $2.6 to $4.4 trillion could be added to the value of the global economy in terms of productivity through generative AI, based on 63 use cases analysed. If generative AI were embedded in software currently used for other functions, McKinsey says that estimate could be roughly doubled.
It’s impossible to say what AI will enable us to do in business over the next decade, but one absolute certainty is that data will be at the heart of making it possible. This is because AI not only needs data to act on, but a huge data set that it can be trained on if it’s to make recommendations that can be treated with high confidence and trust by users.
The data problem - what to keep and what to discard
With so much data in play, there’s anxiety about what to keep online, what to store offline, and what to just discard – and this creates a big problem for infrastructure managers. If, in five years, an AI tool could give you insights that improve the top or bottom line of your business, can you really risk throwing anything away?
Naturally the cost of storing data will rise with volume, and depending on the access and services you need. At Aerospike, our experience is that data in a typical organisation doubles every two years – that means it will be 32 times more voluminous in a decade. With approximately 50% of data infrastructure costs spent on storage, that could get very expensive, even if you’re using ‘inexpensive’ local storage like spinning disks or tape.
Things to consider today
There’s no catch-all answer that will apply to every business, but there are things you can start thinking about today that will help you prepare for data growth:
- Location, location, location – One way to address the challenge is to think about where data is stored. Data that is kept on premises is unlikely to be cheaper – or more secure – than if it were stored in the cloud. Some may argue that offline physical storage is an inexpensive alternative, but it can be difficult to access when needed, degrades over time, and is at risk of theft, loss or damage.
- Optimize – Optimize how your applications work with data, and indeed the databases themselves. More data typically means slower applications, but the way those applications and data stores are built can have a significant impact on costs, processing speeds and ultimately the sustainability of data lakes and AI applications. Graph databases, for example, are significantly more efficient in terms of space, speed and energy requirements than traditional relational databases, especially if you work with unstructured data.
- Deduplicate – How much duplication exists in your data stores? This will have a significant impact on the cost of storing the data and the performance of any application using it. According to Statista, only 10% of the data in the global datasphere is unique, leaving 90% as duplicated data. It’s unlikely that any one application has this level, but if it’s necessary to duplicate, efficiencies will need to be found to reduce its size and store it cost effectively.
Store it, don’t hoard it
The data challenge will not go away, but with careful planning, and an honest assessment of data sets and how they will grow, infrastructure managers can make choices that will ensure data is stored in an efficient, cost effective, secure and sustainable way. The key to success is to accept that “cheap storage”, where you can hoard data is not the answer. You’re making a bigger problem for yourself in the future.
Careful preparation today will mean data is cleansed, stored at minimal cost and ready to be used with future technologies in a heartbeat.