Chaos engineering, the practice of intentionally introducing problems to identify points of failure in systems, has become an important component in delivering high-performing, resilient enterprise applications. Intentionally injecting “chaos” into controlled production environments can reveal system weaknesses and enable engineering teams to better predict and proactively mitigate problems before they present a significant business impact.
As a provider of DNS and traffic management solutions, NS1 has exceptionally high standards for performance and uptime. We support application delivery for many of the most highly trafficked sites in the world. Customers rely on our solutions to drive revenue and operations, so it is imperative that we consistently stress test our systems to ensure they deliver. Our work internally and in collaboration with site reliability engineers (SREs) and application delivery teams at customer organizations has demonstrated chaos engineering is an important use case for traffic management.
The Modern Traffic Stack & Points of Leverage
Although the exact infrastructure varies, we have recognized clear patterns and common characteristics in a modern traffic stack. Our customers are working in highly distributed and complex environments with multiple clouds or CDNs – often with an edge touchpoint – and they are using containers or private cloud orchestration tech. We have found that these modern environments all have points of leverage where traffic automation and intelligent steering can have a profound impact on performance and resiliency. Examples of these leverage points include DNS lookup, edge termination and traffic handling, origin termination and traffic handling, service mesh or internal load balancing, and origin egress.
At NS1, we work with our customers to apply logic in order to steer or manipulate traffic based on business policies, driven by real-time data and telemetry. We are essentially “pulling levers” to steer traffic to boost performance, control costs or route around problems to avoid downtime. Chaos is another application of the leverage exposed by NS1’s traffic management tools – pulling DNS, internal, egress and other levers to create potential failures as a way to test and observe the impact on systems. This could be shifting traffic away from a Kubernetes cluster to make sure the application remains functional or testing how a system routes around failures or responds to DDoS attacks. Injecting chaos at the global steering level helps stress test globally distributed systems and macro-level failure modes. As traffic and SRE teams continue to vertically integrate their traffic stacks, combining global chaos with chaos injected at the service mesh layer will help teams improve end-to-end resiliency of application delivery.
Recently, I had the privilege of speaking about modern, vertically integrated application traffic management and the role of the traffic stack in chaos engineering at the Chaos Community Day event, organized by my friend and respected industry peer Casey Rosenthal. The speaker lineup was phenomenal, including visionaries Nora Jones and Kris Nova. The NS1 team published a summary that includes some of the highlights. We look forward to continued discussions within the community about chaos engineering and how teams are using this practice to build more resilient systems. To learn more about NS1’s role, please reach out to us.