In our digital-first world, downtime or outages of your business-critical applications can severely impact your business. According to Gartner, 98% of companies reported that a single hour of downtime costs over $100,000; one-third reported that hour costs their business $1-5 million. Warner Media, for example, recently had to delay the virtual premiere of the hotly anticipated Zack Snyder’s Justice League, when the underlying platform supporting the premiere experienced an outage - a tough blow for an industry already struggling to adapt to virtual events.

So how do you protect your business from the adverse effects of downtime or outages within your critical infrastructure? In our experience, some common causes of an outage can be mitigated by:

Building redundancy into critical infrastructure
Automating failover mechanisms
Approach updates and deployments strategically

Build redundancy into critical infrastructure

Given the complexity of today’s application environments and the demand for faster innovation and deployment, it’s not so much a question of if you’ll experience downtime or an outage, but when. That’s why it is critical to build redundancy into your most critical infrastructure, so when issues arise, you have backup options. For example, implement a multi-CDN strategy, redundant DNS providers, multiple cloud providers, and so on.

Think back to the 2016 Dyn outage as an example. Dyn was victim to a massive DDoS attack, and brought down some of the most heavily trafficked applications in the world with it. The lessons learned after the 2016 outage didn’t seem to stick; recently, Carnegie Mellon researchers found that 84.8% of the top 100,000 websites still relied upon a single DNS provider, leaving them vulnerable to another outage. Many still rely upon a single cloud provider as well, as evidenced by the number of sites that went down when AWS experienced an outage last fall, and when Microsoft Azure experienced an outage a few weeks ago.

With backup options in place like redundant DNS or multi-cloud, you can steer away from the affected provider, and minimize the impact to your business.

Automate failover mechanisms

To get the full benefits of building out redundant infrastructure, look for a global traffic management solution with automated failover and disaster recovery mechanisms. This is what enables you to automatically redirect traffic seamlessly to a healthy endpoint if another is experiencing issues. Your application delivery environment can change quickly, and without automated routing policies you are still vulnerable to outages or downtime.

For example, at the beginning of 2021 the east coast of the U.S. suffered a major internet outage. The culprit? A Verizon fiber cable was damaged in Brooklyn, affecting companies with nearby application endpoints. While most services were back online within a few hours, the outage hit at the middle of the work day, heavily disrupting remote work and school for people throughout the Northeast.

Most global traffic management solutions enable you to direct traffic across your infrastructure; however, a solution that includes automated and intelligent traffic steering capabilities will significantly improve performance, reliability, and cost. Look for an API-first solution that integrates with leading observability and monitoring tools such as AppDynamics, ThousandEyes, or DataDog. This will allow you to detect outages or downtime in real time, and in turn leverage smart failover to healthier resources. This results in better availability and more uptime, and keeps your business online even during outages.

How Optimizely Ensures Reliability

Learn how Optimizely, a digital experience platform, ensures a consistently reliable end-user experience for their customers across the globe.

Minimize potential issues in routine maintenance and deployments

Another common source of outages and downtime is issues with routine maintenance or deployments. In some cases, this is due to human error. Minimize the potential for issues by automating routine tasks and maintenance. Look for API-first solutions that enable automation and orchestration through CI/CD tools and automation workflows.

Other times, unforeseen issues arise during deployment of upgrades to infrastructure. For example, Cloudflare experienced an outage back in 2019 that they attributed to a bad software deploy. Phased rollouts of changes can mitigate the impact of issues with a deployment. That way, if there is an issue with a deployment, you can roll the change back before it affects your entire user base. With the right tools in place - such as an intelligent, automated traffic management solution - you can ramp up changes gradually and realize blue-green deployments.

The likelihood of your business being affected by an outage remains high - a recent ThousandEyes report found that there were 388 global outages events between March 7th and 12th alone, with 30% of those occurring during business hours. There are a variety of reasons your company could suffer an outage, such as DDoS attacks, infrastructure issues, and more. With businesses increasingly operating as “digital-first”, it’s more important than ever to build resilience into your critical IT infrastructure to mitigate the impact of an outage.

What Can Cause An Outage, and How Can You Prevent the Next One?

Build redundancy into critical infrastructure

Automate failover mechanisms

How Optimizely Ensures Reliability

Minimize potential issues in routine maintenance and deployments

Further Reading

How Grubhub Built Business Resilience

Why DDoS Attacks Are Increasing, and How to Mitigate the Effects of One