Why DNS Can Be Difficult to Benchmark

DNS, it turns out, can be quite difficult to properly benchmark. At NS1 we have a few methods of testing we employ and they each have benefits and drawbacks. The first involves using Pulsar, our latency-based routing engine, and a JavaScript tag that we embed in web pages all over the internet (see the source code of Imgur’s home page for an example), and which measure performance from the end user's actual browser to our network. The second method is to use third party monitoring tools, mainly Catchpoint, which we've found is far and away the best.

A number of clients take the DIY approach which we appreciate, but a lot of the tools out there can be misleading or difficult to use. For example, well respected companies like Gomez and AlertSite cache DNS responses even though their documentation says they do not! This is important when testing popular production domains against test domains that see little or no traffic. For example, we’ve had to prove the the hard way (i.e., generating traffic and analyzing traffic via tcpdump) on more than one occasion that Akamai does not offer 1ms DNS resolution globally versus our ~30ms global time.

SolveDNS and Dnsperf are good examples of public, third party tools that look good on the surface: easy to digest graphs, and they even review us quite favorably: for many months we’ve been the best performing DNS provider. However, closer inspection of the data and methodology reveals some pretty significant flaws.

With SolveDNS you can see by looking at their tests direct to our nameservers that our uptime for the last six months has been 100%. Yet interestingly at the same time they seem to think that Dyn has had quite a lot of downtime over the last six months!

On the one hand we're happy that they're reporting we've had 100% uptime, but it’s clearly not the case that Dyn was at ~98% — if that math were accurate it means they would have been down for nearly nine full days during the last six months (which would be a business-ending outage) and we're also happy that at 8.67ms we're ranked as the fastest of the dedicated DNS providers, but these numbers unfortunately speak to the efficacy, or lack thereof, of their testing methodology.

First, uptime: because DNS is UDP and thus lossy, it's statistically impossible for 100% of queries to be answered on the first try. Indeed, if you take a look at the Catchpoint numbers on some of the tests we routinely run, you'll find that Dyn, UltraDNS, and NS1 all have slightly less than 100% “uptime” which for DNS really means “was this query answered on the first attempt?”

All major, multihomed DNS providers are subject to minor, localized outages, but on the whole because we're all anycasting our uptime is effectively 100% worldwide. Reasons for not answering on the first try might mean that there’s some local congestion with an ISP, or sometimes a fiber cut can preventing traffic from reaching our nodes in a region. We have loads of monitoring in place to alert us to these conditions though and so when we observe a reachability issue we employ traffic engineering to remove the unreachable POP perhaps in exchange for increased latency for users in the region. That's a tradeoff we're more than willing to make though because uptime is far and away the most important metric to all of us.

Regarding latency, SolveDNS says on their report that “there are several DNS services that offer speeds of below 10 ms” — that's physically impossible though and provides another indicator that they don't quite understand how the Internet fundamentally works. The speed of light in fiber is limited to 0.6C, so having 10ms (light can make a 558 mile RTT in that time) response times globally would require something like 15,000 geographically distributed nodes placed all around the world. The Big Three DNS providers (NS1, Dyn, and UltraDNS) all have between 18 and 25 locations worldwide, and adding more actually tends to make response times slower due to the fact that BGP is not very smart. If you change SolveDNS’ statement to “speeds below 10ms in many major markets” it becomes a bit more accurate.

The reason they think we're responding within 8.67ms globally is that they're monitoring nodes are not very distributed — they're only testing from seven locations worldwide: Los Angeles, Dallas, New York, San Francisco, London, Amsterdam and Singapore. Guess where we happen to have POPs? In each of these markets we have servers that are physically just a few hundred miles away (definitely within 558 miles) from all of their testing nodes. In some cases we might even be in the same data centers. Because of this proximity problem it turns out their data is just not a good indicator of real world performance to any of the major DNS providers, who are also deployed in similar markets because there are so many end users there.

With regard to Dnsperf, they're subject to similar constraints but their methodology is definitely better in that they're testing from 13 locations, so we've got much better distribution, and they explain more about what data they keep and what they throw away. What's most instructive about Dnsperf though is if you take a look at the information on the top right hand side of their page you'll see that this was a side project Dmitriy built for jsDelivr when he was looking to replace Akamai and find the provider who could offer the best possible real-world performance. The end results speak for themselves:

$ whois jsdelivr.com | grep -i name
  Domain Name: JSDELIVR.COM
  Name Server: DNS1.P07.NS1.NET
  Name Server: DNS2.P07.NS1.NET
  Name Server: DNS3.P07.NS1.NET
  Name Server: DNS4.P07.NS1.NET

They've actually written a blog post about why they chose NS1 as their provider. Ultimately they agreed that the objective data generated by Catchpoint's 380 global nodes presented a more accurate picture of our performance and uptime than what they were able to generate on their own. What's really key is Catchpoint's ability to measure from eyeball networks and ISPs as opposed to just relying on measurements taken from infrastructure providers and tier 1 carriers. Those measurements can, of course, be useful, but what you really care about is the resolution performance your actual end users get, and Catchpoint is great as it allows us to measure both.

Onto Catchpoint: experience testing can be a very useful tool with their platform (incidentally we integrate with them directly — you can beam that information directly into our platform in order to make routing decisions, mark a server as down, etc.) In order to accurately test raw DNS response performance you want to make sure you're doing a "Direct" test for a domain via its authority. This doesn't take into account things like SRTT, but for the most part that's okay. It's also important to note that you should generally avoid generating and querying random subdomains on a wildcard record because in certain scenarios that can bust caches or activate anti-DDoS measures (we can make wildcard records significantly more performant if that's important to you, we don't do it by default though because it's a common DDoS vector.)

The bottom line is that we are consistently one of the top performing DNS networks when it comes to raw response times. That's only part of the story though, and this is where intelligence and the quality of the answer we return starts to come into play. With properties doing large amounts of DNS traffic, where only a small fraction of their requests reach our name servers because they aren't served from ISP caches, we provide a significantly faster experience for their end users. Time to first byte is probably the best way to think of it: it doesn't matter how fast we spit out an answer, if we send a single user on Time Warner in Southern California to New York because our Geo-IP database is wrong, every single user on that resolver is going to have an additional 70ms added onto their round trip to your servers.

We're so much better at this because we architected NS1 from the ground up with performance and traffic routing in mind. This is stuff as low as kernel level optimizations that map IPs to CPU cores to ensure cache locality, up the stack to things like edns-client-subnet support, RUM data that lets us programmatically optimize our Geo-IP database, and innovative tech like linked records that enable you to skip secondary lookups to CDNs, etc. Even if you just have a single facility we can still perform inbound route optimization for you by routing your end users to your primary datacenter via IP on the specific upstream transit provider that's currently offering the best performance from the end user's local ISP.

While we're confident that the speed with which we answer DNS queries is on par with (and about to eclipse) the other industry leaders worldwide, that's only one part of the equation. What's potentially much more important, but harder to measure across providers right now, is the fact that the accuracy of the answers we return is significantly better than what any other provider can offer. This is truly what sets NS1 apart from the rest of the world when it comes to intelligent DNS and traffic management.