Roblox had become a massively popular gaming platform for children even before stay-at-home orders hit back in March 2020. With children now spending more time at home than ever, its user base and activity levels have surged - their user base grew by 85% in 2020. Hours spent on the platform also soared, growing by 124%.
While record growth resulted in revenue nearly doubling in 2020 compared to 2019, it also presented challenges for Roblox’s IT team. Before the pandemic hit, the team was in the middle of implementing larger infrastructure investments such as migrating to a new DNS provider, building out a custom load balancing system, and introducing a new intelligent traffic steering solution. When lockdowns were enacted in mid-March 2020 and demand for the platform skyrocketed overnight, the team quickly pivoted to meet demand while ensuring an optimal user experience.
Rob Cameron, Principal Traffic Engineer at Roblox, recently shared with us how he managed these competing priorities while avoiding outages, downtime or lags, as well as what’s next for his team.
INS1GHTS Session: Geotargeting Players to Meet Stay-at-Home Demand
For a more in-depth look at how Roblox navigated skyrocketing user demand in 2020, watch Adam Mills' session for INS1GHTS 2020. Or, keep reading for key takeaways from the presentation.
How Roblox Has Invested in IT Infrastructure
Supporting a global, multiplayer platform that allows for user generated games and content while providing the best possible player experience requires robust IT infrastructure. Reliability is especially important to the team - as Mills puts it, “How do you explain to a 9 year old that Roblox just isn’t working?” To avoid this issue, Roblox follows a set of guiding infrastructure principles:
- Build a globally available hybrid cloud to serve players
- Bare metal focus with some cloud providers when it makes sense. Large compute resources in the cloud can be expensive over time.
- Enhance the player experience by prioritizing reliable access and fast game starts
A key area of focus for the team in 2019 was improving the game start speed for users. They identified the following requirements for reducing game start time:
- Get TCP/TLS as close to the players as possible via distributed edge
- Increase speed of their 5 way handshake
- Reuse hardware where possible
To meet that goal, the Roblox IT team decided key infrastructure investments would need to include upgrading their DNS, building a scalable load balancing system in-house, and implementing an intelligent traffic routing solution.
The first step in their infrastructure improvement process was migrating to a new DNS provider that would allow them to target and steer traffic on a more granular level. They decided to migrate to NS1 since it would allow them to introduce latency-based targeting for static and dynamic content, had a high uptime record, and was highly programmable, with a large number of APIs and integrations.
Next, they built a scalable load balancing system. They opted to build one in-house with a common ECMP, L4, L7 pattern. They used HA Proxy and a Github load balancer, and deployed the same architecture within their data centers for simplicity.
Migrating to NS1 and implementing a new load balancing system resulted in improved edge deployments, peering and horizontal scalability, as well as improved observability and monitoring capabilities. Roblox was now also well prepared to mitigate the effects of outages or downtime, as they could automatically route around lost capacity to one of their other 19 PoPs.
The next infrastructure investment was to implement latency-based targeting using Pulsar, NS1’s intelligent traffic steering solution. However, since Roblox focuses on start time more heavily than other streaming platforms, Pulsar’s data sampling mechanism needed to be customized for better accuracy. While the NS1 Pulsar team built out a custom sampling mechanism to accurately measure game start times (a key component to successful latency-based targeting for Roblox), Roblox used geotargeting as a temporary workaround solution. Latency improved, since pathways were no longer routed solely through the U.S. By March 2020, the Pulsar team began to roll out a customized sampling mechanism, and Roblox prepared to switch over to latency-based targeting via the Pulsar platform.
Pivoting to Meet Surging Demand
And then stay-at-home orders hit. When lockdowns began, the team saw exponential growth - their slowest days outpaced their previous “peak” days. In the span of two weeks, Roblox had to double their infrastructure footprint to meet the sudden increase in user traffic. They also had to ensure their infrastructure could handle large-scale, livestreamed events like One World: Together at Home.
To meet these challenges, they leaned on their existing geotargeting capabilities with NS1 to offload their US-based traffic into an adjacent data center while scaling. Additionally, the Roblox IT team took advantage of their newly built latency-based targeting capabilities to offload the traffic in only one swath. As demand and usage stabilized throughout 2020, the Roblox IT team once again turned their focus to long-term infrastructure changes.
So what’s next for Roblox? According to Mills, a key area of focus for the team is data privacy - specifically, ensuring that user data is collected in compliance with COPPA to protect their young user base. And as the company moves forward with plans for an IPO, these key investments in IT infrastructure that allow them to scale quickly and provide an optimal user experience will continue to pay dividends.