If Grubhub has become one of your go-to apps on your phone over the last year, you’re not alone. With stay-at-home orders and social distancing in place, takeout provided a necessary break from cooking and a way to safely support favorite local restaurants.
Because of this, Grubhub saw a sharp increase in users - in the 3rd quarter of 2020 alone, Grubhub had a 68% year-over-year increase in sales. The surge in users was spread out geographically too, a departure from past periods of fast growth. Before 2020, traffic increases were typically concentrated in urban areas; this time, many of their new users were in suburban areas.
Shifts of this magnitude - sharp increases in demand, changing user traffic patterns - can seriously strain IT infrastructure. Grubhub, however, was able to quickly adapt in large part due to long-term investments in building business resilience, turning what could have been a challenging year into an opportunity to gain new users and increase revenue.
In a recent fireside chat, Grubhub’s Alex Trevino, Technical Lead for the SRE team, and Andrew Blum, Sr. Site Reliability Engineer, shared some of the ways Grubhub builds resiliency into their teams, workflows, and infrastructure.
Fireside Chat: Grubhub on Business Resilience
For an in-depth look at how Grubhub builds resilient infrastructure, watch our recent fireside chat with their team. Or, keep reading for key takeaways from the conversation.
Resilient infrastructure starts with your teams and workflows
First and foremost, building resilient infrastructure is dependent upon a well-managed team. Grubhub’s IT teams regularly assess their systems and processes. Since their systems are constantly evolving, they evaluate accumulating technical debt, metrics tracked, general design, as well as ways to improve performance.
For example, Grubhub uses their off-peak season (i.e., when the weather is warmer and more people go out for meals) to review their systems for scale. As Trevino says, "Preparing for higher traffic and building resiliency is a continuous exercise, part of day-to-day culture here. One of the things that we do, since our business is more active when the weather is colder, leading up to Labor Day, we go through the exercise of reviewing all of our systems to make sure that we're scaled appropriately."
Additionally, they strategically update and change critical infrastructure to avoid downtime and disruptions. They’ll hide new features behind A/B testing, feature flags, and so on before rolling out to a wider audience. They also start small, and make incremental changes during rollouts. For example, when migrating to NS1 for Managed and Dedicated DNS, they started with their lowest impact domains, then worked up to business critical ones once confirming all systems were working smoothly.
Build redundancy into critical infrastructure for resiliency
According to Alex, a key area of focus for building business resiliency is designing with failure in mind: “we try our best to ensure that whenever we design any system or workflow, there is no single point of failure. We’re also building systems that are loosely coupled and have self-healing capabilities. When there is a dependency in any of our systems, we try to design them in such a way that we just experience service degradation rather than outright failure. ”
This philosophy dictates all aspects of their infrastructure. Redundancy is built in across their environment: multiple data centers, multiple payment processors, SMS message providers, email systems, and so on.
They have 3 levels of high availability that they consider throughout their infrastructure:
- Multi-active setup
- Passive / active setup with automated failover
- Passive / active setup with manual failover
Where possible, they use a multi-active setup so every component is handling live traffic. For example, they added redundancy at the DNS layer by using both NS1’s Dedicated DNS and Managed DNS in what they consider a multi-active configuration. They’re both serving live traffic at the same time, with records synchronized automatically across the two environments. Since the two are independent of each other, however, if one goes down the other is still available.
When they use an active/passive setup, they implement an automated failover capability, and regularly test failover to make sure that functionality works as expected. And for some systems where automated failover is less mature, they continue to use manual failover while testing for ways to confidently implement automated.
Because Grubhub invested in their infrastructure on an ongoing basis, they were well positioned to meet the challenges of 2020. Amidst a turbulent year, they experienced record growth in regular users, as well as revenue, and reliable back-end infrastructure was a large contributor to that.