The COVID-19 pandemic has caused significant disruption to businesses. As of today, those operating in the UK are encouraged to go back to working from home, just weeks after the government urged people to go back to the office. With rules changing at a very quick pace – not just in the UK but all over the world – life is feeling pretty chaotic.
However, in technology, the word "chaos" needn't always be a negative one. In particular, the COVID-19 pandemic creates quite the compelling case for chaos engineering, particularly as people become more dependent on the online world.
What is chaos engineering?
Chaos engineering is a concept that businesses can apply to ensure resiliency. One of our favourite definitions is that by Gremlin, a Chaos Engineering as a Service company, which describes it as "a disciplined approach of identifying potential failures before they become outages." In doing so, teams will purposefully try to break their systems to see how the systems fare during more turbulent conditions in production.
These practice runs allow businesses to make the necessary tweaks to keep outages, failures, and downtime at bay. In doing so, companies can be more confident in their complex systems when chaotic and unexpected conditions do strike.
The concept of chaos engineering originates from the teams over at Netflix. In 2010/11, Greg Orzell, a cloud architect and engineer at the company, was eager to bolster their resilience testing. In turn, Greg and his peers created Chaos Monkey, "a tool that randomly disables [their] production instances to make sure [they] can survive this common type of failure without any customer impact." The tool works by looking at a company's servers and randomly terminating one every day. A bit DDoS-y, yes, but it's exactly the kind of assessment you need to carry out to see where your business is fragile.
In the Netflix Tech Blog, they write that "[b]y running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice."
Chaos engineering has since enjoyed more widespread adoption and is slowly becoming more mainstream to the extent where, as with Gremlin above, it can now be purchased as a service.
The approach works well as a preventative and proactive measure. Yes, you might already have all the right tools in place to remedy an outage, but repair when you can mitigate?
Chaos engineering during the pandemic
As we know, the world has become more digitally dependent than ever. Businesses have upped their online presence and bolstered their online services and solutions, but many have done so without upping their online resilience (or at least, their certainty of it).
Many online platforms and services experienced a sudden surge in usage during the pandemic, especially back when well over 100 countries were enforcing lockdowns at once. As a result, many online services and internet networks experienced outages (surprisingly, Netflix was also one of them, though they did not comment as to whether it was caused by a surge in usage). Since being online is the lifeblood of businesses right now, it's more important than ever for companies to identify their weaknesses and address them.
Every second of downtime is costly to a business, not only in revenue, but also in reputation. That's the case whether you're a giant tech conglomerate or if you're a small local business. Therefore, the case for chaos engineering is ever-more compelling as businesses do their best to stay afloat in these – yes, I'm going to say it – unprecedented times.