What is DNS Failover?
DNS failover helps websites or network services remain accessible in the event of outage. The Domain Name System (DNS) is the protocol used to translate human readable hostnames into IP addresses. By providing two or more IP address in a DNS record, each IP representing an identical server, you can move traffic from a failing server to a live, redundant server.
Related DNS Concepts
These basic concepts will help you understand how DNS can perform failover.
A request sent by a DNS client—a web browser, application or network device wanting to connect to a remote system via a hostname. The DNS client contacts a DNS Resolver to request an IP address for that hostname. The DNS Resolver tries to locate a DNS server which holds the correct IP address for the required hostname. When it finds it, it obtains the IP address or other required details and resolves the query, by returning the DNS record to the client.
Authoritative Name Server
A DNS server that is responsible for managing the DNS zone for a specific domain or subdomain. It is called “authoritative” because it contains the correct, up-to-date IP addresses, and other information, for hostnames under the domain.
A DNS Resolver starts resolving a query by looking in its local cache for the required IP address, or the address of the authoritative name server for the required host. Failing that, it performs a recursive query, starting from the Internet’s root DNS server, until it finds the authoritative name server.
DNS A Record
An A record is a type of DNS record, stored in a zone file on a DNS name server. A zone file is a text file which contains all the DNS information for a domain or subdomain. The A record simply records the IP address mapped to a hostname, like this:
www A 22.214.171.124
DNS Load Balancing with Round Robin
Round robin load balancing is done within an A record, by assigning multiple IP addresses to the same host. The DNS client tries the first IP address, and if it does not respond, waits 30 seconds for a timeout, and then tries the next address in the list.
www A 126.96.36.199 www A 188.8.131.52 www A 184.108.40.206
Round robin DNS load balancing is inherently limited, because it depends on a timeout on the client side, and doesn’t take into account availability, load or latency, so the user might be routed to a dead or suboptimal destination.
Next-generation DNS services like NS1 add multiple answers to a single A record, with metadata attached to each answer, reflecting the load, geolocation and other relevant parameters of different servers. The decision which server to route the user to is made at the DNS level, not on the client side. This eliminates the timeout, and enables the DNS server to select the most optimal destination for the current user—learn more about NS1’s DNS Global Server Load Balancing.
At every stage in the DNS process, the DNS Client, DNS Resolver, and DNS Name Servers can cache DNS responses they received in the past.
For example, once a DNS Resolver is asked once for the IP address of “www.example.com”, and goes through a recursive query to obtain the correct IP address, it will store it in cache. The next time a client asks for the same hostname, the resolver provides the same IP address from cache.
To prevent cache staleness, DNS records contain a parameter called Time to Live (TTL). It is assumed that different components in the DNS process will only retain cached DNS records for the specified TTL period.
How Traditional DNS Failover Works
DNS failover can work on the client side or on the server side. In either mode, a DNS A record must be defined with more than one IP address (known as DNS A record failover). The first IP address should point to the default, production server, and the other IP addresses should point to identical (or frequently synchronized) redundant servers.
Client Side DNS Round Robin Failover
A browser or network device will typically recognize that more than one IP addresses are provided for the same hostname. If the first IP does not respond, it will wait 30 seconds and try the successive IPs in the list. This enables failover, because if a server is down, a client will eventually redirect to another IP and reach a redundant server.
WARNING: Client side failover is not a recommended option. Client side failover might enable DNS attacks such as DNS rebinding and DNS pinning. In addition, it is not compatible with all browsers and operating systems, and can cause unpredictable behavior in cache-control headers.
Server Side Automatic DNS Failover with Redundancy
To implement failover on the server side, you’ll need to monitor all the servers listed in the DNS records—the primary server and additional redundant servers. As soon as a server goes down, the DNS server should automatically switch the DNS A record to list the IP address for the working server first.
When DNS resolvers come back to request the IP address for the site, they receive the updated IP address, and route the user to the redundant server.
There are several DNS providers that offer DNS failover as a managed service, including monitoring.
3 Limitations of Traditional DNS Failover
#1: Only Updates Once Every TTL Cycle
The basic limitation of traditional DNS failover is that it only takes effect when the Time to Live (TTL) for the host’s DNS record expires. Until that point, the old record will be stored in local cache along the DNS resolution path, and users will continue to be referred to the failed server.
This issue can largely be resolved by defining a lower TTL. At NS1 we recommend defining a TTL of 30 seconds (lower than that will place unnecessary load on the DNS server, and may not be obeyed by DNS resolvers in some cases).
With TTL = 30s, allowing 10 more seconds for monitoring to pick up the failure, 50% of DNS resolvers will update with the new DNS record within 25 seconds of failure. Close to 100% of users will be routed within a minute.
#2: No Failure Detection
Traditional DNS servers, on their own, are not capable of detecting failure. This makes it necessary to run external monitoring of all servers participating in the failover, and “intervene” by changing DNS records when a server goes down. This is a cumbersome process, and even when done automatically, it it complex to implement and has several points of failure.
#3: Not Aware of Load, Geography or Service Capabilities
Traditional DNS failover is not aware of the current load on different servers. For example, if there are two backup servers, and the main server goes down, there is no easy way to determine which of the remaining two servers have less load, and redirect traffic to them.
Additionally, in many web applications, users need to be redirected to a data center or endpoint that is closer to their geographical location, or that provides the services or capabilities they need. Traditional DNS servers can’t do this because it doesn’t have any information about the location or capabilities behind each IP address.
Even if you could implement load or geography awareness yourself—for example, using an external monitoring service—these parameters are dynamic. If you redirect all traffic to the least loaded server, and it becomes overloaded, you would need to reset DNS records again and wait for the changes to be propagated.
Fast, Efficient DNS Failover with Next-Generation DNS Services
NS1 is a next-generation DNS service that provides a globally anycasted DNS service, maintaining 24 global POPs with direct access to Tier 1 Internet Service Providers and hundreds of Gbps of capacity at all times.
Unlike traditional DNS infrastructure, NS1 provides an instant response with no propagation delays. It leverages a global network of DNS name servers based on proprietary technology, which can communicate changes much more quickly than the traditional DNS server, BIND.
NS1 can solve two of the three problems posed by traditional DNS failover:
- Health checks—while traditional DNS has no failure detection, NS1 performs health checks on all resources and routes DNS clients to an available resource, propagating changes instantly.
- Awareness of resource status and capabilities—NS1 collects metadata about the current load, geographical location, network latency, and other important parameters of your resources (which traditional DNS is not aware of) and routes DNS clients to the resource that will provide the optimal experience.
DNS caching and TTL is still an issue, but setting a TTL value of 30 seconds will allow 50% of users to be redirected within 25 seconds of failure, and 100% of users within a minute.
Improve Business Reliability and Security with NS1
Network and application infrastructures are foundational to business technology, and must be highly reliable, secure and adaptable to ensure business success. Learn more about how NS1 can help you build business resilience.