Modern businesses operate in an always-on, always-available digital environment. That's possible due to an ever-growing stack of cloud-hosted tools and platforms. Together, they enable the delivery of on-demand customer services and product types that were all but impossible in the pre-digital age. However, reliance on cloud-based tools also creates business vulnerabilities. In some cases, all it takes is a brief outage of a small third-party component of a tech stack to bring a company to its knees. That reality demands a whole new approach to incident response that extends well beyond IT staff and incorporates dedicated business continuity processes. Here's how to rethink incident response to include detection, diagnosis, and response to third-party cloud outages.

em360tech image

The Transition to Third-Party Dependence

As recently as a decade ago, most enterprises depended predominantly on a self-hosted, often on-site tech stack. Today, few businesses invest in such a setup. In their place, they've come to rely on sprawling, interconnected multi-cloud service deployments. Consequently, countless mission-critical functions — ranging from payment systems and identity management to collaboration tools and file storage — all depend on third-party, geographically dispersed data centers.

Fortunately, most businesses design their stacks to take advantage of their multi-cloud nature to provide fail-safes through redundancy. However, that only goes so far. Only the most essential services typically enjoy full redundancy; otherwise, costs would skyrocket, negating the advantages offered by cloud services. As a result, most companies remain vulnerable to business disruptions stemming from individual service-provider outages. Worse still, maintaining visibility into the consequences of such outages remains challenging.

The Problem With Traditional Incident Response Models

The classic incident response models that businesses have depended on for decades follow a predictable pattern. They begin with problem detection, move to isolation of affected systems, and proceed to fixes that restore normal operation. Unfortunately, reliance on third-party cloud providers often means affected businesses no longer control anything beyond the first step. That means in-house IT teams have no power to influence the recovery process. Once they detect trouble, all they can do is try to engage redundancies or monitor the progress of the third-party provider's resolution attempts. This underscores the need for targeted updates to the traditional incident response model.

The Requirements of a Modern Incident Response Plan

The bedrock of any modern, cloud-aware incident response plan is a multi-pronged surveillance effort. It's not enough to merely monitor the status pages of the third-party providers your business depends on. Often, those don't show signs of trouble until long after an incident begins. Instead, it's necessary to depend on a blend of synthetic performance signals, including distributed telemetry and user behavior. That can provide critical early warnings that enable prompt activation of contingency plans.

The next critical component of a modern incident response plan is accurate dependency mapping. This allows staff to know at a glance which services any given third-party failure will affect. It helps determine incident severity and enables quicker, more accurate notifications to affected business groups.

The third pillar of a modern incident response plan is detailed contingency workflows. These should include offline alternatives for critical functions, backup communication methods, and pre-vetted tool alternatives for temporary use. This can help your business minimize the effect of an outage by keeping important workflows moving forward.

Using Human Behavior as an Outage Signal

Are you enjoying the content so far?

Surprisingly, human behavior can yield a more sensitive outage signal than system logging and direct monitoring. Even a small hiccup in processing can cause an end user to experience difficulties. That means users may report dozens of problems before logs accumulate enough errors to trigger an alert. For example, users may notice a slowdown in a collaboration tool or a loss of cloud syncing, prompting them to check with coworkers to corroborate their experience. Monitoring that type of communication can be valuable and may even yield critical context that aids in problem diagnosis. For example, if collaboration-platform chat logs see an uptick in users asking is Google down, it may point to trouble within Google Workspace or even a DNS issue. That can give response teams a clue about where to look to flag a problem and begin an incident response.

Differentiating Between Outages and Attacks

It's worth noting that proper incident responses diverge sharply depending on the nature of the problem at hand. A cloud provider showing increased latency could stem from little more than a routing issue. However, it could also be the first sign of an unfolding cyberattack. The former calls for engaging continuity contingency plans; the latter calls for an immediate defensive response and a potential lockdown of critical systems. This makes it vital to establish clear criteria to help IT staff differentiate between the two incident classes. Doing so ensures they trigger the proper response pathway and prevents unnecessary or disproportionate action.

Making Digital Resilience a Business Priority

As digital dependency grows, planners need to recognize that even thoughtfully crafted incident response plans aren't enough. They're only a single component of what must be a much larger digital resilience effort. Resilience initiatives must include updated business continuity plans, tabletop outage simulations, and communication dry runs. These help make digital resilience a core competency and ensure long-term business stability, even as digital dependency continues to escalate.

The Takeaway

The bottom line is that the myriad advantages created by multi-cloud deployments make them a business necessity in the current environment. However, planning for the risks they come with is essential. Beginning with sensible updates to incident response plans makes for an excellent first step. With sustained effort, they can form the basis of a functional digital-resilience capability that every modern firm needs to survive and thrive.