SRE and infrastructure engineering are about allocating adequate time to do project work that improves the long-term sustainability of our services. But what do we reward SREs for doing? Does your company have a culture of "not invented here" or the converse of "ask the consultants to design it for us"?
We need to step back and ensure that we are doing the most efficient thing with our time as SREs. The main way that we can improve sustainability is empowering other engineers, especially non-SREs, to understand our systems, rather than for the sake of our resumes and/or glory.
At INS1GHTS 2021: Build the Better Future, Liz Fong-Jones, Developer Advocate at Honeycomb, shared how companies can achieve better observability in their systems in a way that’s sustainable and scalable for their teams.
Keep reading for highlights from the session, or watch the full session replay below for a deeper dive.
Challenges to Achieving Observability
What Does Observability Mean?
So what is meant by observability, exactly? Observability means you are able to understand what is happening inside your systems based on the external data from the system.
Observability also goes beyond break/fix. It affects all areas of the software development life cycle, such as:
Confirming quality of code before it is checked in
Ensuring release cycles are predictably short
Understanding user changes
And understanding which areas of your system still need work
And, it goes beyond the data. It’s not just logs, traces, and metrics. Instead, it’s a broader socio-technical capability. Can you instrument your code easily? Can you add debug events into a data storage system that can handle high volume at low cost? And most importantly, can we answer the questions that we have by using our production data? Can we actually query it and get answers?
Observability in Software
Software developers spend a lot of effort trying to understand their product users. But, focusing on the user can be really hard if you can’t see the full scope of your production environment.
To solve this problem, many engineers turn to developing tools in-house. Often, this can lead to unintended consequences like increasing technical debt. This in turn increases complexity within the system and makes it harder to achieve full observability in a scalable manner.
Because at the end of the day, all new code becomes technical debt. It becomes something you have to maintain and update.
However, there is a better way to achieve observability of your systems at scale.
How to Achieve Observability in Production Systems
Start with your end goal, which as a site reliability engineer or platform engineer is to achieve the appropriate degree of reliability and scalability. That means setting surface level objectives that measure the level of performance your users are experiencing - and make sure that your objectives help you understand when performance constraints are being violated. At the same time, it’s important to make sure that this work is manageable for your team, and scales as your system grows.
It’s equally important to ensure you have the right degree of observability into your systems. This can include support for debugging novel cases in production, rather than waiting weeks to understand what is happening in your staging environment.
Watch the Full Replay of This Session from INS1GHTS2021: Build the Better Future
Check out Liz Fong-Jones’ session, Tradeoffs on the Road to Observability. For more INS1GHTS sessions, visit our replay hub.
Best Practices to Observability
So when designing a highly observable system, it’s important to ensure humans are not your scaling limit (i.e., not relying upon one expert). In other words, avoid making a system that would require cloning your best experts to scale.
And, to get the most value out of your system, consider designing it in a way that all your teams can use it, so there is a shared understanding of what is happening in production.
Achieving these two objectives does not necessarily require writing all new software. It does, however, require a shift in mindset: celebrating outcomes and problems you solve rather than focusing on the volume of work or written code needed to achieve this outcome. A great way to make this shift is to share the problem solving process itself - what your teams tried, what worked v. what didn’t, and identifying how your solution can be reusable across the organization. And, most importantly, identifying how your solution will save time rather than creating more work.
In other words, highlight and reward the work on your team that helps avoid creating complexity.
To do this, consider reusing from some of the following sources when building tools:
Open source software
Tools / code from other teams within your organization
3rd party vendors - if it’s not a strategic advantage for your company to build something in-house, could you use a vendor to save time?
Thinking through the following questions can help you decide whether to build from scratch, reuse existing tools, or look to a vendor:
What problem are you trying to solve?
And more importantly, what is the problem your customers / users are facing?
How will this help us achieve our larger goal of enabling our developers to move fast, without breaking production?
How will this help us achieve larger business objectives?
Who will run and maintain this tool?
The end result is you are able to move fast at scale, without requiring heroic acts from your team. As the DORA metrics show, with the right approach you are able to deploy multiple times per day, make changes on the same day, restore service within an hour, and keep your change failure rate under 15% - all of which allows you to move much faster as an organization.