Jan 30, 2023

CSO Perspectives is a weekly column and podcast where Rick Howard discusses the ideas, strategies and technologies that senior cybersecurity executives wrestle with on a daily basis.

Chaos engineering.

Listen to the audio version of this story.

I love the name, “chaos engineering.” When I hear it, the name conjures some kind of buccaneering spirit; a swashbuckling attitude that I associate with pirates like The Dread Pirate Roberts in “the Princess Bride” or, for you old timers out there, Errol Flynn in “Captain Blood.” The idea of it came on the scene around the same time DevOps started to get popular. I feel like both ideas are cut from the same skull-and crossbones black cloth used to make pirate flags.

DevOps didn’t really hit my radar screen until about 2015 or so when I discovered Gene Kim’s book, “The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win.”1 The idea originated with Patrick Debois2 in 2009 but it took a while for the concept to percolate through the IT channels and eventually arrive in the security domain. Kim’s book was my introduction to the concept. When I heard about it, I thought DevOps, or DevSecOps, would be the new disruptive idea that would fundamentally change how the security community operates.

Before DevSecOps, most security practitioners didn’t do much software development. The ones who did wrote their own tools to accomplish some task, but they weren’t generally part of the organization’s development team. After DevSecOps became a thing, say over the last five years or so, I thought security people would transition to becoming software developers first and security experts second. That hasn’t materialized, at least not yet. There are probably many reasons for this.

One reason might be that, traditionally, the infosec community doesn’t consider automation to fall within the purview of the security professional. That has been a giant mistake in first-principle thinking. Because of that error, the IT community has sprinted away from the security community in pursuing advanced software development techniques, or DevOps, and one of those techniques is chaos engineering.

Chaos engineering for automation and resilience.

Chaos engineering is the resilience discipline of controlled stress test experimentation in CI/CD (continuous integration / continuous delivery) environments to uncover systemic weaknesses. Chaos engineers build hypotheses around expected software behavior, design small footprint (tiny blast radius) experiments that vary steady-state behavior (like bandwidth and CPU use), and then run those experiments in production systems to learn about unknown design flaws. Admittedly, this is an advanced tactic for first-principle strategy, it isn't for the small, medium, and even some large companies. But, if your organization provides global digital services that absolutely have to be running 24 X 7 without any downtime, then you likely have a team of chaos engineers somewhere performing these experiments.

To understand why these global service providers require chaos engineering, you must first accept the fact that we no longer live in a linear, digital world. When the internet emerged as a useful business tool (1990s), things were pretty simple. We didn’t think so at the time, but compared to today, that world was Kindergarten. If you changed one thing in that world, you pretty much knew what was going to happen. But today’s IT environments are systems of systems. We’re in PhD land here. They are complicated and most of us have no idea how they actually work and what the real dependencies are between all the software modules deployed on all of our data islands.

According to Rosenthal, Jones, and Aschbacher in their book, “Chaos Engineering: System Resiliency in Practice”,3 “A change to input of a linear system produces a corresponding change to the output of the system. Nonlinear systems have output that varies wildly based on changes to the constituent parts.” It’s like that old chestnut: when a butterfly flaps its wings in China, you might end up with a hurricane in the Gulf of Mexico. When the hard drive of a system running a non-essential monitoring app in an AWS region in North America fails but somehow causes a system wide failure, this is what I'm talking about.

These systems are complicated, and humans can’t possibly understand all the permutations in their head. Software engineers think they know, and DevOps and SRE teams write linear regression tests for things they assume to be true. But those teams don’t learn anything new by doing so. They test properties of the system that are already known, like previously corrected defects and boundary conditions of the main features of a product.

Rosenthal, Jones, and Aschbacher say that these kinds of linear regression tests, “... require that the engineer writing the test knows specific properties about the system that they are looking for.” Chaos engineering, in contrast, is the pursuit of the unknown. They don’t replace linear regression tests, they are trying to solve a different problem by uncovering unknown and, as yet, undiscovered design faults.

Chaos engineering is built on the scientific method. DevOps teams develop a hypothesis around steady-state behavior and run experiments in production to see if the hypothesis holds. If they discover a difference in steady state between the control group and the experimental group on production systems, then they have learned something new. If not, they have gained more confidence in their hypothesis. They use techniques to minimize the blast radius on the production system and monitor the entire experiment carefully to ensure there will be no catastrophic effect, but they have to be on the production system to do it.

When I have talked about resilience as a first-principle strategy in the past, I have always pointed to Netflix as the poster child for this new tactic. I said that Netflix routinely runs an app, like Chaos Money, that randomly destroys pieces of their customer facing infrastructure, on purpose, so that their network architects understand resilience engineering down deep in their core.

When I first learned about this technique, I was stunned by the audacity, and seeming recklessness, of the approach. In my past career, I would never destroy parts of my production system on purpose for an experiment. I might do it by mistake, but never on purpose. In hindsight, as I have learned more about the subject, that’s not exactly how chaos engineering works. It’s audacious for sure, but the Netflix chaos engineering system is mature, and their DevOps teams have been developing the practice since 2008. They have learned how to do this and their experts wouldn’t recommend that newbies to the idea start by destroying parts of their production system. You have to ease into it.

History of Chaos Engineering.

Chaos engineering began back in 2008 with a couple of delivery failures at Netflix.4 The company was transitioning from a DVD-mailing company to a streaming company. The Netflix leadership team very publicly announced its commitment to adopt AWS cloud services and abandon its own data centers. This was a big idea since Amazon had just rolled the service out two years before, and it wasn’t yet what anybody would call mature.

The Netflix precipitating event was a database failure that prevented the company from delivering DVDs to their customers for three days. That obviously wasn’t resilient. Furthermore, that Christmas in 2008, AWS suffered a major outage that prevented Netflix customers from using the new streaming service. In response to these incidents, Netflix engineers developed their first chaos engineering product in 2010, called Chaos Monkey, that helped them counter the vanishing instance problem caused by the AWS outage. With that success in their pocket, Netflix began building their own chaos engineering team and wondered if they could scale it. If they could fix the small scale vanishing instances problem, could they do the same at the vanishing region scale?

In fairness, Netflix wasn’t the only company thinking along these lines. Back in 2006, Google site reliability engineers (SREs) established their own Disaster Recovery Testing program (DiRT) to intentionally insert failures into their internal systems to discover unknown risks. But their cool name for it (DiRT) wasn’t as hip as the Netflix name (Chaos Monkey) and it didn’t catch on. The idea was similar though.5

By 2011, Netflix began adding new failure modules that provided a more complete suite of resilience features. Those modules eventually became known as the Netflix Simian Army and include colorful names like Latency Monkey, Conformity Monkey, and Doctor Monkey, just to name three. There are many more.

Netflix shared the source code for Chaos Monkey on Github in 2012 and by 2013, other organizations started playing with the idea. By 2014, Netflix created a new employee role (Chaos Engineer) and began working on ideas of reducing the blast radius of planned injected failures. By 2016, Netflix had an entire team of Chaos Engineers working on the Simian Army. By this time, there was a small but growing contingent of companies experimenting with the idea too (like Capital One, Google, Slack, Microsoft, and Linked-In).

What does chaos engineering have to do with automation and resilience?

Traditionally, linear regression tests, SRE and DevOps teams, and IT resilience have generally been the purview of the CIO. There are definite divisions of labor for resilience though. The CIO is handling the DevOps piece and the CSO needs to be part of the team. But I’m making the case that chaos engineering is something that should be owned by the CSO. Who better to discover potential unknown systemic failures that might impact production or the ability to recover from an event quickly? The CIO handles the known stuff. In terms of first principles, the CSO’s job description should be to discover unknown faults in the system that will cause material damage.

According to Rinehart and Shortridge in their book, “Security Chaos Engineering,” traditional security programs orbit around failure avoidance.6 Infosec teams design and implement people, process, and technology policy designed to prevent the organization from getting anywhere near a disaster. In contrast, they say that failure is where an infosec team learns the most. I agree. If you can build these small experiments that uncover potential systematic failure, that might be the most valuable thing an infosec team does.

Rinehart and Shortridge say that this mindset changes the infosec team’s focus away from building a purely defensive posture and towards something that is adaptive. Instead of seeking defensive perfection, pursue the ability to handle failure gracefully. And if that sounds familiar, it’s pretty close to the definition I have been using for resilience that I got from the Swedish academic team of Björck, Henkel, Stirna and Zdravkovic, “… the ability to continuously deliver the intended outcome despite adverse cyber events.”7 It also implies that, especially at this scale, this graceful handling of failure will be handled at the infrastructure-as-code level.

Rinehart and Shortridge recommend that the infosec community move away from security theater (a concept made famous by one of cybersecurity’s thought leaders, Bruce Schneier).8 This is the idea that infosec teams perform work that creates the perception of improved security but in reality, doesn’t add much. One example of this could be the purchase of an anti-phishing product that delivers approved phishing email messages to employees to train them not to click bad URLs. Or another is building an insider threat program designed to prevent employees from taking their old powerpoint slides with them to their next job. In the big scheme of things, are those kinds of security theater programs as impactful as discovering a previously unknown fault in the organization’s system design that could cause catastrophic failure? The notion is worth considering.

Specifically with respect to traditional security, however, Rinehart and Shortridge suggest that you could apply chaos engineering to other more typical security functions like red teaming. Instead of turning loose the Red Team to find some hole in the defensive posture, we could instead develop a hypothesis around how the organization should react to a specific attack sequence, for example, Wicked Panda. If we treat red-teaming exercises as a science experiment with a hypothesis that defines how we think the organization will react to a Wicked Panda attack, we might learn something new. If that’s true, we could expand this kind of thinking to all sorts of traditional security tasks like container security, CI/CD pipeline security, security monitoring, incident response, and so forth. You might say you’re already doing those things. But what I’m suggesting is a subtle shift away from rudimentary tests of the system with things we already know about and towards the more advanced scientific method designed to uncover the things we don’t already know.

That said, chaos engineering is not for everybody. It’s another tactic that we might use to reduce the probability of material impact due to a cyber event. It’s another arrow in our quiver to build our resilience program alongside the other arrows like crisis planning, incident response, backups, and encryption. The concept is probably a bridge too far for most small to medium sized organizations who struggle to find resources just to keep the lights on. But, for big silicon valley companies that deliver services from around the world (the Netflixes, the Googles, the Linked-Ins, etc) and for most Fortune 500 companies, chaos engineering is something to consider. Indeed, many of these companies may already be on this path.

References.

1 Kim, G., Behr, K., Spafford, G., 2014. The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win [Book]. URL https://www.goodreads.com/book/show/17255186-the-phoenix-project

2 Staff, n.d. What is DevOps - Explained [Website]. New Relic. URL https://newrelic.com/devops/what-is-devops

3 Rosenthal, C., Jones, N., 2020. Chaos Engineering: System Resiliency in Practice. O’Reilly Media.

4 Staff, 2018. Chaos Monkey at Netflix: the Origin of Chaos Engineering [WWW Document]. Gremlin. URL https://www.gremlin.com/chaos-monkey/the-origin-of-chaos-monkey/ (accessed 12.6.22).

5 Bort, J., 2016. Meet Kripa Krishnan, Google’s queen of chaos. Insider.

6 Shortridge, K., Rhinehart, A., 2023. Security Chaos Engineering.

7 Björck, F., Henkel, M., Stirna, J., Zdravkovic, J., 2015. Cyber Resilience – Fundamentals for a Definition, in: New Contributions in Information Systems and Technologies. Springer International Publishing, Cham, pp. 311–316.

8 Glaskowsky, P., 2008. Bruce Schneier’s new view on Security Theater. CNET.