Taboola is a content discovery network that gives online users personalized recommendations at the bottom of the web page of what to read next. They do this on-premise out of 10 data centers with about 10,000 servers in total at scale. Each second Taboola handles a mind-boggling 1.5 million requests with just under one terabyte of direct access. Taboola offers its personalized service and recommendations for 3 billion web pages per day.
Taboola's on-premise concept has six front-end and three back-end data centers. Each back-end data center allows access to three availability zones.
What Could Go Wrong?
Data centers need to be kept cold. However, when they overheat things quickly go downhill. Taboola experienced this before when one of their data centers rose to over 91 degrees. This resulted in a crisis event with a full downtime to their back-end services, taking over 10 hours of valuable time to fully recover.
The Taboola IT team really connects with General George S. Patton’s famous quote, "The test of success is not what you do when you are on top. Success is how high you bounce when you hit the bottom."
After this incident, Taboola’s IT team thought about how they could better manage crises. How could their crisis management team fix the problem as quickly as possible? How high can we bounce when we hit rock bottom? Finally, they settled on automation and created a web service button that immediately alerted all 50 teams to a crisis.
But then they thought, what next?
They decided to set their priorities by looking at other high-stress careers such as pilots and the FAA. The ultimate order of things for the pilot is to fly the aircraft first. The order was to aviate first, then navigate, and then communicate.
Taboola wanted to repurpose this concept for their crisis management during IT downtime events. To do this, they started by asking what needs to happen first? How can the services quickly be brought back online? Who needs to know? What's going on? How do you learn from this to ultimately bring the service back up for the end-users?
Watch the Full Replay of This Session from INS1GHTS2021: Build the Better Future
Watch Ariel Pisetzky’s full session, When the Firefighters Come Knocking. For more INS1GHTS sessions, visit our replay hub.
The 3 Priorities
Move Forward (Navigate)
These three priorities came in handy six months ago when Taboola experienced a major event within one of its availability zones at a data center. The on-site employee headed to the data center and immediately asked whether the service was back up. How can we automate best at that moment?
The priority is to get people in as they commit to the business. This requires all hands on deck and then doing things such as reducing the capacity for features and search sizes or whatever works best in your IT shop. Finally, ensure that the service is back up with as much business logic as possible. It needs to be operational while continuing to fix what's broken.
Just keep moving forward. Who is taking care of the system? Who chooses to test for A or B? Bad choices will happen, but it's ok as long as choices continue to be made and pushed forward. Keep pushing up. If the team is large, split them into task forces. This allows for rotations if the outage takes a long time.
Communications should be pushed both internally and between the teams who are working. Having a Zoom channel allows everyone to offer their insights while making it easier for the crisis manager to continue making choices.
Taboola set up this automation as an automatic war room service that opens the Zoom room, sends out the all-hands on deck pager alert and the Zoom link so everyone can join. It also sets up an automatic track creator board. This automation allows good visibility to service owners to see what's happening in the underlying infrastructure.
How high did Taboola bounce when they hit this rock bottom moment? There were minimal client operations disruptions and none for the end-users. Getting operations back up and running took only 3 hours, a 70% reduction in downtime vs. their previous incident.
To learn more, check out the following resources: