Recent outage underscores the importance of network redundancy and secondary DNS
Robinhood, a popular mobile trading app, experienced a complete outage earlier this week leaving users unable to access their accounts or make trades. A spokesperson for the company stated that the outage was due to a breakdown of infrastructure that allows systems to communicate with each other. Further reports from the company attributed the outage to high trade volumes triggering a DNS failure.
The impact of a DNS outage is felt across the entire organization and community of users. When it fails websites, applications and online services become unavailable bringing operations, revenue and brand reputation down with it.
Robinhood isn’t the first company to experience a DNS outage during a peak traffic time. In November 2019, Amazon experienced a 13-minute outage that cost the company an estimated $2.6 million. That same week Apple experienced a 10-hour outage that the company attributed to a DNS configuration “blunder”. If a DNS blunder can take out two of the world’s largest and well-established technology companies, it is no surprise a similar occurrence could happen to a smaller, albeit well respected and innovative app provider.
In the case of the Robinhood app, their growth strategy and subsequent multi-billion-dollar valuation were partially dependent on word-of-mouth marketing and user reviews.
This outage left the company’s 10 million accounts without access to their investments during two volatile trading days, which predictably impacted their customer satisfaction. And while the company has demonstrated a commitment to improving customer experience, the damage may already be done. One user is quoted in the New York Times as saying “For me, the moment they get up I am going to try to get out and switch out to someone else.”
To help stem the loss of irate customers, Robinhood is offering $15 per account to users. Though this seems like a small sum of money for those who watched as their investments lost considerable value this week – this could cost the company as much as $150 million if each account makes a claim.
How Does DNS Failure Happen
In the case of Robinhood, so many people were trying to access their trading application that the DNS became overloaded and was unable to respond to those “requests” for content and data. But outages can also be caused by configuration and software errors, faulty equipment or cyberattacks. Outside of the obvious impact to brand reputation, this can result in substantial financial loss, legal implications, and in extreme cases, a push for industry-wide regulations.
Preventing a DNS Disaster
How can other online trading platforms, or any company that depends on up-time and positive user experiences, prepare for spikes in traffic so they can ensure they don’t experience a catastrophic outage at a critical moment in time? These companies can implement the following best practices to minimize the risk of downtime and ensure resiliency:
- Build-in redundancy at the DNS layer: Redundant DNS involves deploying a second DNS network that does not share the same infrastructure (servers, networks and data centers) as the first. This may involve two separate providers or a single provider managing two independent DNS networks under a single pane of glass. DNS redundancy ensures that if one DNS network falls under duress, that the other will subsume the queries for the pair, ensuring that queries don't go unanswered. Organizations that deploy always-on, redundant DNS networks for their domains can prevent outages and recover much faster.
- Monitor system performance continuously: Monitoring the health and response times of infrastructure and applications is a key part of system resilience. Measuring how long an application’s API call takes or the response time of a core database, for example, can provide early indications of what’s to come and allow IT teams to get in front of these obstacles. To increase the success of these programs, companies should define SLAs for different sub-applications and systems, and then monitor to ensure they remain in line.
- Leverage modern, anycast DNS: “Cloud-first” computing environments require modern DNS with the speed and flexibility to scale with infrastructure in response to demand, and an API-first architecture that supports integrations to automate infrastructure management for improved resiliency. DNS should also leverage a resilient, anycast network so that DNS requests are dynamically diverted to an available server when there are global connectivity issues.
- Automate DNS management processes: Companies can reduce manual errors and improve resiliency by automating DNS Management and embedding intelligent decision-making capabilities within the networking infrastructure itself. These organizations can also implement data-driven traffic steering to route around congestion or overloaded infrastructure. In these scenarios, dynamic data can be pulled from monitoring apps using APIs to ensure that decisions are made with real-time information.
The full impact of this outage is yet to be seen, and the hope is the company will be able to bounce back by reassuring their customers that this was a one-time event. By implementing these measures Robinhood can make these assurances to its customers. This also provides a learning opportunity for any company that depends on up-time for a strong customer experience, and they can take action before they suffer a similar unfortunate event.