At NS1, we’re big fans of Catchpoint, the performance monitoring company who’s also one of our awesome customers. On the one hand, we deliver global DNS and traffic management for Catchpoint’s products. And on the other hand, we lean heavily on their globally distributed monitoring nodes to understand the behavior of our own systems, especially when it comes to tuning our hyper-optimized anycasted Managed DNS network. The visibility we achieve using Catchpoint’s nodes into how networks around the world are reaching our anycasted IP space helps us pinpoint localized BGP issues, identify problem carriers or providers, and plan performance improvements – constantly, as the data flows in.
And “constantly” is a good description for how we like to consume our data. NS1’s entire platform is built on the idea that data drives better decisions in real time. That notion goes beyond the capabilities we offer our customers to drive DNS routing, and into how we operate our own platform. Every service, server, subsystem, and tool – every component of NS1’s infrastructure and operation – is instrumented, measured, monitored, and analyzed constantly, and we’re always thinking about what to instrument and measure next.
To keep an eye on things, we use several tools to collect and gain insight into data about our platform’s performance. We combine external tools like Catchpoint with heaps of internal technology we’ve developed to match the scale and breadth of our infrastructure.
One of the systems we operate that’s critical to our operational mentality is OpenTSDB, a powerful open source time series database deployed atop Hadoop’s HBase. OpenTSDB is a repository for literally millions of NS1’s metrics, from system, server, and network telemetry, to deep DNS traffic analytics. Our stack leverages OpenTSDB as a store for customer-facing metrics exposed through our APIs and UIs, for internal operational dashboards (which we usually build with Grafana), to drive high frequency pattern matching and alerting (often crafted with Bosun), and many other applications.
Alerting in particular is a fascinating area in a system as distributed as NS1’s. Alert too often, and you’re drowned in a deluge of noisy network flaps. In a platform at global scale, we see lots of false positives, and even for real issues, our systems mostly automate around hiccups. But we’re in the mission critical path for our customers and failure is not an option, so we can’t ignore real potential issues. In a system like ours, measuring, monitoring, and alerting on the delivery of our actual service is, of course, most important.
A key path to enabling good alerting on our services, then, is to plug the best data about how effectively we’re delivering DNS globally into the powerful dashboarding and alerting frameworks we’ve built so we can get real-time visibility and notice potential issues instantly. So, we built a tool for that.
It’s nothing big or complicated, but because it might be useful to others, we’ve open sourced our Catchpoint to OpenTSDB bridge - just a simple server that listens for data from Catchpoint’s Push API, and shoves the data into OpenTSDB. Internally, we’ve hooked the raw data into our Grafana dashboarding to get a great global view of performance and reachability issues in our Managed and Dedicated DNS networks, and we’ve also plugged in Bosun to quickly generate alerts on aggregate data by continent, country, and with specific problematic ISPs.