By Terry Bernstein, Director of Product Management, NS1
When outages occur that take out the online services we rely on so heavily, they typically dominate the news headlines. While there are many possible causes of outages, it's not uncommon for the DNS (domain name system) to be the root cause of the incident. That’s because DNS is a building block for internet connectivity. Every device connected to the internet relies on it.
It is not just people in their homes streaming a movie on Netflix, or a Teams or Zoom call that depend on DNS, but entire banking systems, government networks, critical national infrastructure and enterprise IT environments across the entire world. If something goes wrong with the DNS, and with so much dependence upon it, layer upon layer of systems start to break, leading to a cascading chain of failures.
DNS risk factors
There are multiple risk factors, but the two that rise to the top are: human error, usually involving misconfigurations, and malicious activities.
Misconfigurations are surprisingly common and are often responsible for taking DNS servers offline. The issue may be in the DNS system itself, such as when someone “ fingers” an incorrect IP, or the problem may occur in the network infrastructure. For example, a network engineer may accidentally withdraw the BGP announcements for the organisation’s self-hosted DNS servers.
Human error may also be one of the root causes that enables malicious activity. For example, a DNS administrator may be socially-engineered to give up their access credentials, or they may accidentally misconfigure access permissions and allow an unauthorised employee greater access. Whatever the root cause, once a malicious actor is in the DNS system, they can wreak havoc. This gives them the power to direct the users (employees, contractors, customers, etc) for all applications to wherever they want. And if they are good, they can do this in a way that is tough to identify.
When it comes to DDoS attacks, malicious actors have increased their focus on targeting DNS, recognising the impact it has. For the perpetrators, the success of DDoS attacks, in which a DNS is so flooded with queries until it becomes totally unavailable to users, is measured not just in how damaging it is for the victim, but in the notoriety it achieves. A successful DDoS attack will bring down ALL of an organisation’s internet-facing web sites and applications.
Unfortunately, the complex and distributed nature of modern infrastructure makes the DNS more vulnerable. As cyber-attackers maximise every opportunity to exploit vulnerabilities, attacks against the DNS control plane and the caching hierarchy of DNS are likely to grow, rather than diminish until organisations implement appropriate measures to protect their domains.
What we hear about all too often in the news are the external-facing components of DNS – those that enable publicly accessible systems to run smoothly. But these are only part of the story. Increasingly, huge enterprise environments support even higher levels of internet traffic to enable microservices, access to APIs and internal applications. Given this growing appetite for availability, and the damage and disruption caused by outages, it is clear that far from being redundant technology, the DNS should be prioritised at all costs.
Protecting DNS to prevent outages
Protecting DNS to ensure that it is always available requires a whole system approach. There is no “magic bullet,” but rather, teams need to consider a combination of approaches that encompass automation, redundancy, security, and the creation of a strong incident response team.
Let's take these one by one. First, automation.
One of the best ways to avoid human error is to automate common tasks and to build verification steps into the process. The automation also will enable fast recovery to a known good state should something untoward occur. Using tooling like HashiCorp Terraform, DNS administrators can treat DNS updates in the same way that development engineers process code. That is, they can create the proposed change, have it reviewed and approved by another engineer (e.g., like a PR), and then execute a Terraform update to push the change into production. If something goes awry, then the previous state, as configured in a Terraform configuration file, can be restored.
Redundancy of the DNS resolution network is the best practice for improving DNS availability. This means that a secondary DNS network, or multiple DNS systems, are in place with separate infrastructures. These systems should all use Anycast routing so that queries can be routed to any server, on any of the redundant networks, and that all of them can answer every DNS query. Organisations should assume there will be failures.
Security should be another key element of a DNS availability strategy. One should start by ensuring that all of the basic controls are in place, such as two-factor authentication, controlling access to those that need it, and logging everything. You should also regularly review those logs to detect any anomalies that might indicate a misconfiguration caused by human error, or malicious activity. DNSSEC can be used to help prevent external attacks that try to change DNS answers once they’ve left your servers, but before they get to your users. By using DNSSEC the integrity of DNS records is protected through digital signing and verification by recursive resolvers. It allows DNS clients to verify they are receiving accurate DNS information rather than fake data from cyber-attackers.
Centralise incident response
Even in the best prepared scenarios the worst can happen. The final piece in the jigsaw of preparedness is to establish a strong incident response team. By creating a ‘virtual war room’ that can centralise operations, organisations can consolidate everything on a single channel with one person pre-appointed to act as the coordinator.
Perhaps the most important message for security teams is not to delay. DNS is not the layer in the tech stack that gets all the plaudits or generates all the excitement but ignoring it has the potential to put any organisation in peril.