“Set it and forget it” is the approach that most network teams follow with their authoritative Domain Name System (DNS). If DNS is working and if end users are finding network connections to revenue-generating applications, services, and content, then administrators will generally say that you shouldn’t mess with success.
It’s unfortunate that the reliability of DNS often causes us to take it for granted. It’s easy to write DNS off as a background service precisely because it performs so well. Yet this very “set it and forget it” strategy often creates blind spots for network teams by leaving performance and reliability issues undiagnosed. When those undiagnosed issues pile up or go unaddressed for a while, they can easily metastasize into a more significant network performance problem.
The reality is that like any machine or system, DNS requires the occasional tune up. Even when it seems to be working well, specific DNS errors require attention so minor issues don’t flare up into something more consequential.
We came up with a few pointers for network teams on what to look for when they’re troubleshooting DNS issues. All of these are based on data you can see in DNS Insights, a new feature of NS1 Managed DNS which helps identify misconfigurations, monitor traffic patterns, and pinpoint security challenges before they impact your network.
Set baseline DNS metrics
No two networks are configured alike. No two networks have the same performance profile. Every network has quirks and peculiarities that make it unique. That’s why it’s important to know what’s “normal” for your network before trying to diagnose any issues.
DNS data can give you a sense of average query volume over time. For most businesses, this is going to be a relatively stable number. There will probably be seasonal variations (especially in industries like retail), but these are usually predictable. Most businesses see gradual increases in query volume as their customer base or service volume grows, but this also generally follows a set pattern.
It’s also important to look at the mix of query volume. Is most of your DNS traffic to a certain domain? How steady (or volatile) is the mix of DNS queries among various back-end resources? The answers to these questions are going to be different for every enterprise, and may change based on network team decisions on issues like load balancing, product resourcing, and even delivery costs.
Monitor NXDOMAIN responses
NXDOMAIN responses are a clear indication that something’s wrong. It’s normal to return at least some NXDOMAINs for “fat finger” queries, standard redirect errors, and user-side issues that are likely outside of a network team’s control.
NS1’s recent Global DNS data report shows that anywhere between 3-6% of network queries are usually NXDOMAINs for one reason or another. Anything at or near that range is probably to be expected in a “normal” network set-up.
When you go over double digits, something bigger is probably happening. The nature of the pattern matters, though. A slow but steady increase in NXDOMAIN responses is probably a long-standing misconfiguration issue that mimics overall traffic volume. A sudden spike in NXDOMAINs could be either a localized (but highly impactful) misconfiguration or a DDoS attack.
The key is to keep a steady eye on NXDOMAIN responses as a percentage of overall query volume. Deviation from the norm is usually a clear sign that something is not right - then it becomes a question of why it’s not right and how to fix it. In most cases, a deeper dive into the timing and characteristics of the abnormal uptick will provide clues about why it’s happening.
NXDOMAINs responses aren’t always a bad thing. In fact, they could represent a potential business opportunity. If someone’s trying to query a domain or subdomain of yours and coming up empty, that could be an indication that it’s a domain you should buy or start using.
Watch out for exposure of internal DNS data
One particularly concerning type of NXDOMAIN response is caused by misconfigurations which expose internal DNS zone and record data to the internet. Not only does this kind of misconfiguration weigh on performance by creating unnecessary query volume, it’s also a significant security issue.
Stale URL redirects are often the cause of exposed internal records. In the upheaval of a merger or acquisition, systems sometimes get pointed at properties which fade away or are repurposed for other uses. The systems are still publicly looking for the right connection, but not finding the expected answer. The smaller the workload, the more likely it is to go unnoticed.
Pay attention to geography
If you set a standard baseline for where your traffic is coming from, it’s easier to discover anomalous DDoS attacks, misconfigurations, and even broader changes in usage patterns as they emerge. A sudden uptick in traffic to a specific regional server is a different kind of issue than a broader increase in overall query volume. Tracking your DNS data by geography helps identify which kind of issue you’re facing, and ultimately provides clues on how to deal with it.
Check SERVFAILs for misconfigured alias records
Alias records are a frequent source of misconfigurations, and deserve regular audits in their own right. We’ve found that an increase in SERVFAIL responses - whether a sudden spike or a gradual increase - can often be traced back to problems with alias records.
NOERROR NODATA? Consider IPv6
NXDOMAIN responses are pretty straightforward - the record wasn’t found. Things get a little more nuanced when you see the response come back as NOERROR but you also see that no answer was returned. While there’s no official RFC code for this situation, it’s usually known as a NOERROR NODATA response when the answer flag comes back as “0”. NOERROR NODATA means that the record was found, but it wasn’t the record type that was supposed to be there.
If you’re seeing a lot of NOERROR NODATA responses, in our experience the resolver is usually looking for an AAAA record. If you’ve got a lot of NOERROR NODATA responses, we’ve found that adding support for IPv6 usually fixes the problem.
DNS cardinality and security implications
In the world of DNS, cardinality refers to the number of resolvers associated with a single DNS record. The simpler the DNS query, the lower the cardinality. DNS records with high cardinality tend to be more complex transactions involving multiple servers.
Measuring DNS cardinality is important because it can be an indicator of malicious activity. Specifically, an increase in DNS query cardinality can indicate a random label attack or probing of your infrastructure at a mass level. If you’re seeing an increase in resolver cardinality all of a sudden, it’s likely an indication of a botnet attack.
Learn more about DNS Insights, NS1’s powerful new observability feature.
A version of this post originally appeared in DZone.