Occasionally, we get questions from customers as to whether caching by DNS resolvers defeats the purpose of DNS based load balancing, failover and perhaps DNS traffic management in general. This article addresses those questions.
First, let’s quickly review how DNS traffic management works and why caching might make it less effective. On NS1, the foundation of our traffic management is the meta data we can associate with DNS records. For example, the A record for nycdatacenter.example.com might have metadata that indicates:
- Its geographic location (US east coast NYC)
- Whether that location is available (UP/DOWN)
- How many active connections it is currently serving.
Equivalent metadata can be maintained for a west coast data center A record – ladatacenter.example.com.
The next step in DNS traffic management is setting up the rules for using the metadata to choose between the two A records when responding to a query for example.com. We have the Filter Chain for that. The rules might be:
- Eliminate any location with UP/DOWN status = DOWN.
- Sort the records in order of geographic proximity to requestor.
- If the connection count at the closer location is below a defined threshold, send that answer to the DNS resolver.
- If the connection count exceeds the threshold, apply a weighting to the responses in order to divert some traffic to the data center that is less busy but further away.
In short, we use DNS to make sure users are never sent to an unavailable data center, are directed to the geographically closer data center, and we use dynamic load balancing to prevent one data center from becoming overloaded while the other has capacity to spare.
That is all good, but what happens if the DNS resolvers respond to requests from their cache and rarely actually forward queries to our authoritative servers? If most of the end user DNS requests never get to our nameservers then how effective can DNS traffic management be?
DNS resolvers cache and reuse the answer received from an authoritative nameserver for the amount of time specified in the record itself: the time to live or TTL. TTLs are configurable and for effective DNS traffic management shorter is better, up to a point. For records where time sensitive traffic management is needed (such as failover records) we recommend setting TTL between 5 and 60 seconds. One reason is most resolvers will not accept a TTL of less than 5 seconds from the authoritative server and instead will arbitrarily select a much longer TTL. It is possible to put a TTL of zero to tell the resolver to never cache the record, but in practice, this rarely works as expected.
Remember as well, there is a good reason for wanting answers to come from resolver cache rather than from the authoritative nameserver. Responses from cache get to the end user about 60 ms faster than ones that come from the authoritative nameserver. And if you use a DNS provider that is less performant than NS1 then the speed delta is greater. So you want a TTL that meets your traffic management objective without excessively sending queries to the authoritative nameserver.
You will also increase the percentage of queries that go to the authoritative server (and are not answered from the resolver cache) when you configure EDNS(0) CLIENT-SUBNET on your records. This feature results in the /24 subnet of the end user being sent to the authoritative nameserver as part of the query. Its purpose is to allow for more accurate geo-routing because the location of the end user is used rather than the location of the DNS resolver. The effect however is to increase the percentage of queries that go to the authoritative server because resolvers create a unique cache entry for each /24 subnet coming from end users. So rather than have a single entry in cache for “example.com” there are multiple entries for every unique /24 of the source IP address of incoming requests. The empirical data indicate that EDNS(0) CLIENT-SUBNET increases query counts on the authoritative DNS by about a factor of four. But to be clear, the only reason to use it is to improve the accuracy of DNS geo-routing. It won’t make the information held in resolver cache more “real time”.
Summarizing, DNS based traffic management involves balancing and optimizing two factors:
- You want users to be answered from resolver cache as much as possible.
- You want the DNS information in resolver cache to be current with respect to the metadata it reflects.
With a 30 second TTL the DNS information from resolver cache is on average 15 seconds old. Is that current enough for DNS traffic management to be effective? The answer to that depends in part on what metadata we are talking about:
- Geo meta data is not time dependent – it’s a static attribute of the A record so you don’t need a short TTL to do effective geo routing.
- UP/DOWN is probably the most time sensitive of all the metadata we support. When a site goes down you don’t want to send users to that site. With monitoring the DNS “knows” a site is down because it stops responding to monitoring requests (PING, TCP, HTTP). Most customers monitor at 20 second intervals. So on average, our DNS knows the site is down within 10 seconds of when it went down. And on average resolvers will stop sending users to that site within 15 seconds of the meta data being updated. So, 50 % of incoming users will be successfully routed via DNS within 25 seconds of the outage. Close to 100% of the users will be successfully routed within 1 minute of the start of the outage.
There is debate in the industry as to whether resolvers actually respect TTLs. There have been some studies and the results suggest that about 90% – 95% of resolvers do respect TTLs. The other proof of course is in the pudding. As many of our customers have found, intelligent DNS traffic management is an excellent way to do load balancing, failover, and performance based application routing. It works, and it doesn’t require adding in an inline cloud based layer 7 approach which of course adds cost and complexity to your application delivery stack.
For more information about DNS Traffic Management check out these resources
Global Load Balancing Data Sheet:
Global Load Balancing White Paper