Every catastrophic website failure of the past few years features similar apology language.
Here’s Ticketmaster after a rush to buy Taylor Swift tickets crashed their “Verified Fans” site:
Never before has a Verified Fan on sale sparked so much attention – or uninvited volume. This disrupted the predictability and reliability that is the hallmark of our Verified Fan platform.
Here’s the Treasury Department’s response when mass purchases of I-series bonds led to a site crash:
We are currently experiencing unprecedented requests for new accounts and purchases of I Bonds…We continue to balance these efforts with our commitment to the overall integrity of the 20-year-old system and protecting the personal identity and financial assets of our customers.
Here’s U.S. CTO Todd Park after heavy demand sank the healthcare.gov launch in 2013:
These bugs were functions of volume. Take away the volume, and it works.
Don’t blame us! Nobody could have planned for this much internet traffic!
That excuse might have worked the first couple of times a site went down due to high volume from a single incident. But in this day and age, shouldn’t we expect huge surges of internet traffic from genuinely popular things?
This comment from Maria Catalano, a 72-year-old Massachusetts resident who found herself unable to sign up for a COVID vaccine due to high website volume, pretty much sums it up:
Anyone with any knowledge would know that when you put a million people on one site, one day, it is going to crash. Didn’t they think about that ahead of time?
Sudden spikes in internet traffic are no longer surprising. What’s surprising is that businesses and government agencies continue to be caught off guard, particularly when they know they’re supporting something with mass appeal.
Fear of a catastrophic DDoS attack was supposed to be motivation enough for network teams to add capacity to their DNS infrastructure and prioritize resilient, redundant system architectures. Those are justified fears - DDoS attacks continue to rise, posing a significant threat to websites everywhere.
Yet, the steady stream of network outages from legitimate traffic shows that many key websites don’t have the strategic depth they need. If a site can’t handle an expected increase in regular visitors, how can we expect it to handle an unexpected attack from bots and malicious actors?
How much capacity is enough?
Of course, there’s a flip side to this argument that deserves consideration. How much capacity is enough? Or, more to the point, how much idle capacity is worth paying for? It’s all well and good to say, “Company X should have known better,” but the network team at company X also has to justify the budget for resilient capabilities they may never use.
It all comes down to risk tolerance. Plenty of enterprises have made the decision that they would rather risk a catastrophic downtime event than pay for resilient, redundant capacity.
In an ideal world, these decisions would be guided by data. According to a recent EMA study, each minute of downtime costs an average of $12,000 - that’s an hourly rate of over $720,000. Per the Uptime Institute, over 60% of outages cost more than $100,000 in 2022 - a 40% increase over just two years ago.
Unfortunately, the decision to invest in resilient capacity is rarely made after a concrete, strategic, pros-versus-cons discussion. It’s usually made by default when IT investment priorities are set. Resilience is rarely sexy. It’s easy to cut when other urgent operational priorities are right in front of you.
The blast radius of a downtime event is rarely considered when risk tolerance decisions are made by default. IT managers and finance people are usually the ones cutting resilience or capacity budgets, but a large-scale downtime incident impacts other departments disproportionately. From marketing and PR to senior leaders and operations managers, a lot of people can get called on the carpet when something goes wrong.
The value of resilient DNS
It’s tempting to think of resilient or redundant systems as something that’s only put to use when absolutely necessary - that they’re just chewing up money until an incident happens. But in the case of DNS, that usually isn’t the case.
Secondary DNS systems are often put into active service right alongside a primary provider. Whether it’s load balancing, geographical distribution, or taking advantage of traffic steering functionality, there are plenty of reasons to split DNS traffic between two (or more!) providers. Having more than one active DNS system is about more than just resilience - it’s also about using “best of breed” features that may differ from vendor to vendor.
On the financial side of the equation, with today’s cloud-based, pay-as-you-go sales models (like the one NS1 uses) you’re never paying for truly idle capacity. If a large-scale event happens, you leverage the enormous depth and pooled resources of a large vendor without having to purchase those resources yourself. You’ll end up paying to service all those inbound requests, but at least your system won’t crash. It’ll continue answering queries (and generating revenue), so you can weather the storm.
There really aren’t any excuses left for DNS-related downtime. Whether it’s a DDoS attack or a spike in legitimate traffic, every network and IT team should be prepared for unusual activity. It’s not a question of if but when. History shows us time and again that spikes in traffic can come from many sources - both expected and unexpected.
When those traffic spikes come, what will you have to say? Will you be apologizing for your lack of capacity? Or will you simply be explaining how your resilient systems came out on top?
Learn more about NS1 Dedicated DNS - the ultimate option for DNS resilience.