A Grand Failure At MGM
David Walker is puzzled by what some cloud vendors claim is acceptable database functionality
When one of the world’s largest retailers decided to expand its offering by opening a new online marketplace to third party sellers, an already compelling consumer proposition was immediately strengthened.
For a while things went swimmingly. But, alarm bells started to ring when it realised its popular—allegedly cloud-native and cloud-perfect—database architecture was misfiring.
Specifically, it was telling customers they’d snaffled the last item in stock when it was already in someone else’s basket. Worse, the system hadn’t spotted the duplication in time.
The organisation was understandably frustrated after making an expensive back end database change, having been assured that this new form of ‘business’ database was going to be completely fine for its global ecommerce requirements.
What irritated the firm even more was that it had written a whole set of apps to support the newly expanded e-marketplace, including ways to list, search for, and display products.
So, a lot of work had gone into making its new virtual market as attractive and usable as possible. But, it failed to deliver the level of bullet-proof reliability required in the core plank of the system, the product catalogue.
To be super-techie for just a second, the company needed the application to do all the rollbacks necessary to maintain consistency, but to also do multi-row inserts in practice to ensure that any item was always where it should be.
A lot of jiggery-pokery (or in developer terms, “complex compensatory transaction processing”) therefore had to be added to keep on top of any inconsistencies. That was because the database didn’t really do this, even though older generations of business (relational) could. Even worse, if you promise a customer the last item in stock and you don’t have it, you have to buy it from someone else, replace it, and hope the buyer doesn’t notice. Which also means more money.
Clearly, this was a very serious cloud databases issue. What to do?
Eventual consistency: not worth the wait?
I’m not sharing this story because I want to single out the database as inadequate—though you’ve probably guessed that I have something to do with the open source distributed SQL database that finally solved the consistency issues of this huge brand.
No: I’m sharing this because some strange ideas are being circulated in the enterprise cloud world right now regarding what key database terms like ‘reliability’ and ‘consistency’ really mean in today’s world.
Many users require these features, but some vendors seem keen to underplay their importance. A key one is what the NoSQL people call ‘eventual consistency.’ It was not sufficient for this use case, nor is it for many others.
Why? A lot of the time we’re told that consistency only really matters for banks. That the database feature ‘really’ only needs to be factored in if you’re moving money from Angela’s branch in Boston, to Aashvi’s mobile banking app in Uttar Pradesh, for example.
But, consistency is important, and for a lot more than just current accounts. Be it money in an account, a box of cornflakes, or an Apple laptop in your warehouse, it can only ever be in stock or in somebody's order. It can’t be in both, and it can't be in neither. If you have a flaky database, that’s what it’s going to try and persuade you is the case.
So, you think you only need distributed SQL if you have a widely distributed customer base—i.e., customers in multiple countries? Hmm. Well, if you have a global customer base, then distributed SQL should 100% be on your radar… but there’s a misunderstanding (deliberate or not) about what ‘distributed’ means.
It does NOT just mean physically distributed, ie: geographically dispersed or in different regions of the world! In modern cloud database terms, ‘distributed’ just means shared between multiple servers, so there isn’t a single point of failure that can knock you offline.
By the same token, database resilience means you don't want your service to go down if your cloud provider has a brownout, or if your data centre is hit by a meteorite. If you have three different servers sitting in the same room and the power goes out in that room, you’re out of action and are not resilient.
Seen from this perspective, database resilience is absolutely not just for banks. A telco or a technology company active in multiple markets wants its services to be available globally and around the clock. So does a sparky start-up offering an amazing fast food app, a videogame company, or a streaming music service.
To be fair, NoSQL databases will eventually catch up. And for many use cases, they’re fine. But, more and more enterprises are finding, like the global retailer I am talking about, they want more than a so-called ‘eventual consistency’ offering.
NoSQL was a great first stab at all this—but time’s moved on
The reality is that eventual consistency isn’t sufficient in a lot of cases. Distributed SQL is good because it gives you the resilience that my (genuine) big marketplace retailer couldn’t get with their other database. And this is a feature any serious online business will need.
Enterprises must be resilient - they cannot afford the cost (and reputational damage) of their services going down. However, part of the cloud database world has, for 10 years or so, been pretending that when you carve up a unified database and spread it about so no single server has the only copy of any part of that data, they can still offer you transactional consistency.
The other truth we must face is that although NoSQL databases are a great first generation attempt to put databases into the cloud, the tricks and hacks they require to make it (more or less) work, means they threw away the transactional baby with the monolithic bathwater.
My takeaway for the CIO is this: you adopted NoSQL because it works well in that cloud environment, and the idea of many little machines and no single point of failure does deliver you a kind of resilience.
However, there are lots of situations, workloads, and use cases which do require true transactional consistency. If true transactional consistency is what your business needs, distributed SQL is likely to be the only way to achieve it.