It’s undeniable that operating enterprise networks and applications are growing in complexity. The number of migration projects moving production applications and workloads across different on-premise and hybrid cloud environments are increasing. The number of software updates being deployed is increasing as more enterprise development teams adopt agile, DevOps and/or CI/CD practices. The need to maintain site reliability and user experience in the face of these changes is putting enormous pressure on network and operations teams to improve their efficiency and effectiveness.
In this blog we explore how applying intelligent traffic steering can simplify and de-risk these operational tasks. While most networking teams are familiar with using basic traffic steering policies such as round-robin to improve application performance, the addition of real-time infrastructure data and configurable logic for traffic steering delivers new options for improving productivity.
Introduction to NS1 Filter Chain
The key to operational efficiency is being able to embed intelligent decision-making capabilities within the networking infrastructure itself. To make a decision you need data and logic.
With network and application infrastructure, the data required to make decisions is a mix of fairly static information (e.g., where is this resource located geographically) as well as dynamic information (e.g., is the resource available, is it approaching capacity overload). Dynamic data from existing monitoring apps using APIs is the best way to ensure that decisions are made with real-time information.
The innovative approach NS1 takes is associating metadata with each of the potential DNS responses. That metadata is automatically obtained through our APIs and integrations with solutions such as AWS CloudWatch or DataDog.
The decision-making logic depends on the business requirements of the application and the available information. Most of the time the logic involves sorting the list of DNS responses according to specific criteria (e.g., sort the list of available web servers by capacity) or removing items from the list (e.g., remove web servers marked as unavailable from the list). At NS1 we call these logical units filters.
When NS1 receives a DNS query, we first look up the list of potential answers and then apply Filter Chain to that list before serving an answer. Each filter takes a list of DNS responses as its input, performs a single action using the associated data (e.g. sorting or removing answers), and then passes the modified list to the next filter in the chain. The last filter selects the answer based on the filter criteria and returns it as the best possible answer to your user.
This combination of data and logic enables you to implement custom decision-making logic very easily by chaining together simple, single-purpose algorithms. Having this capability at the DNS layer empowers you with more flexibility in steering enterprise users to application resources than ever before.
1) Improve site reliability with live data and logic
Site reliability is a critical operational metric for any digital business or service. Traditionally, availability problems would be handled through an incident response process. The monitoring solution would send an alert to the IT service desk. The operations team would triage the event and escalate it to the appropriate teams for resolution. Then teams will gather the data, troubleshoot the problem (which may involve several rounds of finger-pointing across teams) and then implement a fix. All of this takes time, time that customers are impacted by the issue. A common measure of operational efficiency is MTTR (mean time to repair) because the longer it takes to repair a business-impacting issue the more damage it causes the business.
NS1 Filter Chain enables network teams to automatically route users to alternative application resources until the problem is resolved, without having to reconfigure Layer 3-4 routing or physical network devices. Why? Because, if a backend server is unavailable, the monitoring solution simultaneously sends data to NS1 through our API to update the metadata to unavailable. The UP filter automatically checks the availability of metadata and removes the IP address from the list of answers. As a result, users are automatically steered to alternatives while IT Operations works to fix the problem. Similarly, when the problem is resolved and the monitoring solution recognizes the resources are back online, NS1’s metadata is updated, the resource that went down becomes a viable option again and user traffic returns to the original routing patterns.
From a user perspective, the application is still available, the business impact is minimized, and IT Operations has breathing room to resolve the issue.
2) Prevent issues proactively with global load balancing
We’ve seen how our UP filters can react quickly to infrastructure issues, now let us look at how we can proactively prevent both availability and performance issues with our global loading balancing capabilities.
The ability to intelligently balance load across multiple data centers or colocation facilities is particularly important if you have hard limits on how much capacity you can handle from a specific location. For example, if you have 10,000 users and there is a sudden 50% increase, having an efficient way to route those new users to data centers with more capacity can prevent overloads and performance brownouts that negatively impact the business. Essentially, as you get closer to the high watermark, the less traffic you want to send to that location – in other words, shed the load to other locations.
We do this by bringing in live capacity data through our APIs and integrations (such as number of active connections) and configuring high and low watermarks with Filter Chain. Setting the high point at 100 connections tells our system to stop serving new traffic to this facility entirely. Setting the low point at 80 connections tells our system to start reducing the amount of new traffic being sent. As the monitored number of connections creeps up toward 100, more and more traffic will be sent to other locations. Conversely, as the connections slide downward, less and less traffic will be sent to other locations.
3) Mitigate operational risk with blue/green deployments
Another way to improve operational efficiency is to take the risk out of migrating traffic between two environments. Instead of having a singular cut-over date where all users are directed to the new environment, you can ramp up the percentage of traffic being directed to the new environment in stages. This capability is useful in a variety of scenarios:
- Rolling out a new version of software or service
- Migrating an existing application to a new environment (e.g., from on-premise to cloud)
- Canary testing software changes in production environments, usually with a small group of users who are unaware of the testing
With Filter Chain, we can define which subnets, or geographies, or networks are steered to this new environment and what percentage of traffic is going where. To ensure that subsequent requests from users do not randomly flip between blue and green environments, we’ve added “stickiness” to the filter logic. Once the filter transitions someone to the green environment, then their subsequent DNS queries always be steered to the green environment. This ensures consistent application experience for users during the transition.
Risk is mitigated because any problems with the new environment will affect a much smaller percentage of the user base. Once you’ve ensured that everything functions properly, transitioning the remaining traffic in stages is easy because you have an orchestration system in place. Essentially, you can drain the traffic from one location to another by changing the weights over time.
4) Avoid application availability issues with fast DNS propagation
The more ephemeral application and infrastructure resources become, the more important it is from an operational perspective to have DNS records accurately reflect any changes created by automated deployment and auto-scaling. If a DNS server half a world away hasn’t been updated that service B has been deployed with IP addresses released by the removal of service A, then all sorts of application problems will occur. From a user’s perspective, the application is unavailable, and the transient nature of DNS propagation could result in a lot of wasted time troubleshooting time and effort.
NS1’s fast DNS propagation ensures that changes are globally visible within seconds – not minutes or hours. This is something that legacy DNS/DHCP/IP Address Management appliances still struggle to do.
5) Get up to speed quickly with a single API
Additionally, our platform codebase is the same across our managed and enterprise solutions. Since the API operates the same way across both public-facing and internal networks and applications. The operational skills are readily transferable – there’s no need to learn three or four different vendors APIs to support your entire environment.
The efficiency gains from applying smart traffic steering capabilities can add up to a significant level of success for enterprise network and application teams. With NS1 Enterprise DDI, the same operational capabilities and techniques leveraged by cloud-scale application providers can be readily applied to enterprise networks and internal applications. By modernizing your DNS, DHCP, and IP address management technology, we help you do more by working smarter.