Chaos Engineering is the practice of experimentation on a system to get a better understanding of how that system will handle the unexpected. This creates safety boundaries within a system leading to better stability. Unfortunately, these systems are far too complex to model. This complexity can allow the system to drift towards the danger zone. There are two primary myths that surround Chaos Engineering.
Misconceptions About Chaos Engineering
The first is that to actually implement chaos engineering, one’s organization needs to be a Wizard Level 10 type company, such as Netflix or Google. This myth was born out of Netflix’s typical digital transformation from the data center to the cloud. Netflix had to discover a way to be more reliable and used chaos engineering to build a more stable system. Their process forged the path leading to other organizations within the health care, finance, and banking spaces to try out chaos engineering for themselves.
Watch the Full Replay of This Session from INS1GHTS2021: Build the Better Future
Check out Courtney Nash's session, The Prerequisites for Chaos Engineering. For more INS1GHTS sessions, visit our replay hub.
The second asks the question, why would we inject more chaos into our system? It’s more than that, in any case. It is about examining the confusion that already exists, buried and complex. Chaos engineering is about seeing and understanding what’s going on within the system. This myth should either confirm or refute the hypothesis. In a way, chaos engineering borrows heavily from clinical trials, things that were tested out until they were usable for the public. For example, experimentation within the medical community was crucial to the rapid turnover of COVID 19 vaccinations.
The Four Prerequisites of Chaos Engineering
Courtney Nash of Verica joined us at INS1GHTS 2021:Build the Better Future to break down and examine what she calls “The 4 Prerequisites of Chaos Engineering.”
The first is instrumentation, having the ability to detect degraded states in a system. Can you tell if it’s up or down? Is it working out correctly or not? Metrics to monitor this are necessary, and this is the easiest to implement.
Having a social awareness amongst everyone involved and directly impacted by the system is valuable. It’s important for people to know in advance what experiments are being done to prevent animosity and resistance from affecting future studies. This can make it challenging to experiment down the line if not handled correctly. Being upfront from the beginning to get consensus is crucial.
There need to be expectations that the hypothesis can uphold. Having a reasonable assumption is crucial to successful chaos engineering. There is little value or purpose to a theory that cannot be tested or supported. One needs to be realistic about what’s known about the system, and if something is broken, can it be fixed first? Ideally, pick a system you want to know better or have a hyper-specific problem to learn something from the experiment.
Have an alignment to respond, uphold the hypothesis, and learn something while also figuring out an area in the system that chaos engineering can change. Take what’s known and apply it. The tricky part with this prerequisite is that teams are already so busy that adding another project on top of everything can be overwhelming. Teams need to be aware of the benefits of applying the results to get everyone on the same page. Chaos engineering requires above-the-line thinking and work. It requires a certain amount of not just technical infrastructure but cultural infrastructure too. Investing in cultural infrastructure will significantly benefit your teams.
How to Put Chaos Engineering into Practice
Given how complex (and therefore easily breakable) IT infrastructure has become, there is growing interest and confidence in adopting chaos engineering - 82.8% plan to get started in 2021.
An easy way to get started in your organization is to try out game-day scenarios or hackathons. Get your team in a room and make them responsible for examining a system. Then, shut off a component of that system and record the learning outcomes from the experiments. Hackathons can be safely done in staging environments and not run in production. Having these trial runs can be a great way to experiment with what chaos engineering can do, and by having a retrospective with your team, everyone can be made more comfortable.
For more, watch Courtney’s full INS1GHTS 2021 session here.