The NS1 operations team runs the robust network that powers our Managed DNS product line. We’re the ones who operate the twenty-six global points of presence which answer DNS queries from some of the world’s largest and most consequential companies. Given the critical nature of DNS to all of NS1’s customers (and by extension, the internet at large), we’re obsessed with resilience and uptime.
Delivering a reliable network through the inevitable system upgrades, platform changes, and sheer volume of growth that NS1 experiences in a typical year requires a laser focus on our network architecture. We have to make sure that any network change - however minor - won’t introduce any risk to our ability to deliver.
NS1’s NetBox journey
Back in 2019, the operations team realized that basic information about the NS1 network was scattered across a bunch of different tools. We used many different sources of information to describe our network architecture - Ansible inventories, DNS entries, waitron configs, Ubersmith, Terraform plans, AWX, Google Sheets, Confluence documents, and more. None of these information silos were connected to one another. We lacked a single network source of truth to provide relevant information about all our machines and devices - things like MAC addresses, locations, types, interfaces, spares, elevation, and how they were connected.
We found that this balkanized view of the network ended up creating a lot of work for the team. It happened in one of two ways:
First, we found ourselves manually filling in the same baseline information about our network into various management tools. Whenever we wanted to make a change, the tool would ask, “what existing configuration do you want to change?”
Second, we were constantly juggling integrations between management tools and our various data repositories. For every global change, we had to make calls to data holdings that were scattered across different places, often in different formats. Managing all of those integrations was a huge pain in the neck.
What we wanted to do was automate our network management processes. Instead of manually entering network information over and over again, we wanted network tools to automatically populate the relevant data from a single place. Instead of building and managing integrations between network tools and multiple data stores (with varying data formats), we wanted everything to flow from one place, in the same format.
What we wanted
At a basic level, we needed a consolidated network source of truth - a data source that could provide authoritative information about every element of the network and the connections between those elements. We needed something that would document every piece of hardware - servers, switches, racks, power supplies, and more. We also wanted to document virtual assets like IP addresses.
This network source of truth had to present information about the network in a format that any tool could digest without manual effort. That meant a robust API.
We also wanted an easy way to populate the system with network information. That meant using an API to call on all our legacy data stores to scrape information.
The best things in life are free. Ideally, we wanted to use an open-source tool as our source of truth. Or at the very least, we didn’t want to pay too much.
Then there was a grab bag of smaller features that we wanted: Single sign-on, change tracking with histories, tagging, read/write access controls, TLS/SSL, IPv4, and IPv6 support, and a pleasing user interface.
Comparing available tools
When we started doing market research, NetBox jumped out at us as the clear choice. It’s free and open source. It’s widely used, with a large and active community. It has a well-documented API with lots of functionality built-in. It documents everything we wanted to document, from data center hardware to IP addresses.
NS1 was already using Ubersmith for a wide variety of billing and data center management functions. We considered extending the data center management side to become our source of truth. We quickly found, however, that doing so would have ripple effects on our billing systems. We decided that it would be better to decouple billing functions from our larger network automation drive. Ubersmith is also a paid service - we wanted something cheaper (or free).
We also considered using Ansible inventories as our source of truth. On the surface, it made sense - we were already using Ansible as an automation framework in certain areas, so it seemed natural for those automation tasks to draw from a native data repository. Ansible is also free and open source. Yet when we looked into the operational side of things, it turned out that Ansible wasn’t the best fit. For example:
We found that Ansible inventories require constant updates - a process that was surprisingly manual.
We discovered that Ansible isn’t always an authoritative source of truth. When several work streams are happening in parallel, Ansible opens up multiple pathways to update the inventory. This can lead to both data conflicts (where competing work streams have to be reconciled) and gaps (where parallel efforts don’t lead to a comprehensive picture).
Finally, queries against Ansible inventories introduce a lot of latency into network operations. We found that NetBox lookups take roughly half the time.
How NS1 uses NetBox
Once we decided to go with NetBox, it only took a few weeks to scrape our data stores, validate all our information, and build out the connections to our various network tools. Now just about all of our operational network infrastructure is documented in NetBox. There are a few backup hardware elements (most of which are sitting in drawers and not plugged in) that we didn’t add to the inventory. We also found a few low-level IP assets that weren’t worth the trouble of documenting in the NetBox IPAM system. Everything else is in there.
With NetBox in place as our network source of truth, we’ve been able to scale our network operations and accomplish more in less time through the use of automation. All of that manual input of network data has vanished. Our ongoing integration work is minimal - just occasional API connections between new network tools to our NetBox repository.
Here are a few examples of how we use NetBox to manage the NS1 network through automation:
NS1’s peering API consumes switch and router information directly from NetBox.
We use NetBox to manage all of our IP allocations and assignments.
We’ve also built a Terraform script to document hardware and virtual asset provisioning in NetBox. Whenever a new device or virtual asset is added to the network, Terraform automatically adds the IP address to NetBox through an API call.
We perform periodic automated data scrapes of our infrastructure providers for comparison against what we have documented in NetBox. This ensures that resources provisioned outside of normal workflows are also reflected in our source of truth.
Terraform automatically tags devices or groups of devices for use in Ansible scripts. Those tags are also documented in NetBox as device features, allowing us to automate actions based on groups of devices that share the same tag.
Ansible generates templates for device configurations and uses playbooks to push those configurations out across the network. Before those configurations are deployed, NetBox checks them for accuracy against the actual state of the network. This ensures that every configuration matches reality, preventing errors that could lead to downtime.
We’re currently working on a new use case around device availability. Using NetBox-documented server information, we’re going to run scripts to discover which devices have a BGP-ready flag. Servers that meet that criteria in NetBox will be automatically pushed into service through a provisioning procedure.
The value of NetBox
We see the day-to-day value of NetBox every time we run an automated script in Terraform or Ansible. Manually inputting network information, managing a web of integrations - we don’t miss any of that. When you add up the time we used to spend on all that drudgery, it probably amounts to hundreds of hours saved every week.
We’re also far less concerned with fat-finger errors and gaps in deployments - NetBox checks everything and ensures that whatever we put into production is error-free. On a network that’s critical to thousands of businesses and covers this much of the internet, the value of that kind of security is difficult to underestimate.
At a strategic level, NetBox is the cornerstone of our network automation drive. We’ve built a more responsive, reliable, secure network on the foundation of NetBox’s source of truth. Our internal developers and engineers rely on network automation to increase the velocity of their own efforts, giving NS1 a long-term advantage in the market. We rely on automation to give us a leg up against the competition by developing and releasing products faster.