Resilience Case Study: Chaos Engineering.

By Rick Howard

Jun 27, 2022

CSO Perspectives is a weekly column and podcast where Rick Howard discusses the ideas, strategies and technologies that senior cybersecurity executives wrestle with on a daily basis.

Resilience Case Study: Chaos Engineering.

Listen to the audio version of this story.

“Pass on what you have learned. Strength, mastery, hmm…but weakness, folly, failure also. Yes: failure, most of all. The greatest teacher, failure is.” – Jedi Master Yoda, The Last Jedi

When I was a young Captain back in 1985, Army personnel, in its infinite wisdom, assigned me as the communications officer to the 3/19th Field Artillery battalion located at Fort Polk, Louisiana. Ronald Reagan was the President at the time, in his second term, and the United States military hadn’t seen any real combat since the end of the Vietnam War in 1973. It was a decent time to be in the military because nobody was shooting at us (Bosnia was another seven years away) and President Reagan convinced Congress to give us a lot of training money.

One of the key ways you got promoted back in those days was doing well at the National Training Center (NTC) at Fort Irwin, California. The Army would ship entire divisions, in my case the 5th Mechanized Infantry Division, to the California desert to do battle with a world class operational force (OpFor) who emulated the Soviet Army, the ultimate Red Team. It didn’t matter what rank you were, private to general, if your unit did well in that two-week training exercise, there was a positive mark on your promotion potential.

I was assigned to the Division’s field artillery battalion specifically to help them succeed at that exercise. Back then, communications was largely a matter of line-of-site radios; giant things called AN/GRC 47s that were about the size of the old IBM personal computer but much heavier. You would slap them into racks inside of personnel carriers or on the back of jeeps attached to these long (about eight feet) whip antennas. But, they were expensive and you didn’t have spares sitting around to replace them if they failed. And let me tell you, during a two week field exercise in the middle of summer deep in the Mojave Desert, those things with their solid state electronics failed a lot. And compared to an infantry battalion that might have one or two channels operational during a battle, the field artillery had at least 10. That meant that out of a 500 man battalion, just about 80 radios had to be operational at all times. If any one of them failed at the wrong time, it might mean that the battalion would tank the exercise. It was a no-win scenario for my unit and perhaps for my career. With no spares to speak of, what was a poor communications officer to do?

Well, I cheated.

By the time I joined the 3/19th, I had been stationed at Fort Polk for about three years. I knew people (it was a small place). I went around to all of my buddies who weren’t participating in the exercise and borrowed their radios. I got one from Joe, two from Larry, a handful from Sue, and shipped them all out to the NTC with the rest of my equipment. During the exercise, when a radio would fail, I would just replace it with a spare and send the broken radio back to the rear for repair. With the quick turn-around from the repair shop (about a day) and the spares I had on hand, everybody always had a working radio. At the end of the exercise, the NTC training evaluators singled me out specifically in the after action review as managing some of the best communications systems they had seen at NTC. (They didn’t know about the spares).

Looking back through rose colored glasses at that time, you know I was thinking about how Captain Kirk defeated the Kobayashi Maru test in my favorite Star Trek movie, “The Wrath of Kahn.” He defeated the character test of a no-win scenario by cheating. I'm just saying.

And by the way, I kept track of all of those radios (the serial numbers, operational status, locations), with my very first personal computer, an Apple IIC, complete with color card and the VisiCalc spreadsheet software. Man, I am old.

And you may be wondering why I told that very long story about my dinosaur days in the U.S. Army. What could broken radios have to do with senior executives considering cybersecurity first principles in 2021? I'm glad that you asked.

This is all about resilience; “… the ability to continuously deliver the intended outcome despite adverse cyber events.” That definition comes from a paper published in 2020 by some Stockholm University researchers. It’s exactly what I was trying to do at the NTC and exactly the opposite of what the Colonial Pipeline company did when it was attacked with ransomware by the cyber criminal group, Darkside (See last week’s essay and podcast: “Resilience Case Study: Colonial Pipeline attacks of 2021 (Cyber Sandtable).”

In the traditional cybersecurity sense, especially defending against ransomware, most infosec practitioners think that resilience is some combination of a good backup plan, a robust encryption program for material data, and for bonus points, a mature failover system. In other words, if the cyber criminals encrypt your data, recover from your well tested backup files. If they also steal your data in a double extortion scheme to sell to third parties, make sure that your material data is encrypted. Finally, if they destroy key production systems, have a hot standby system ready to go. Or, you could just pay the ransom and pray that the bad guys hold up their end of the bargain.

All of those tactics have been established as best practice for at least a decade and maybe even longer. But some of the big silicon valley companies (like Netflix, Google, Linked-In, Microsoft, and others) have taken the idea of resilience to the next level. Their leadership has embraced the notion of something called Chaos Engineering. It’s the concept that instead of waiting for your systems to fail (which they will for any set of relatively complex systems) and hope for the best, they instead cause them to fail on purpose to observe if the resilience system in place actually behaves the way that they think it should. Chaos Engineering is resilience planning at the next level and something we should all be studying.

What is Chaos Engineering?

In order to understand what Chaos Engineering is, we have to first accept the fact that we no longer live in a linear, digital world. When the internet emerged as a useful business tool (say in the mid-1990s), things were pretty simple. We didn’t think so at the time, but compared to today, that world was Kindergarten. If you changed one thing in that world, you pretty much knew what was going to happen. Today’s IT environments are systems of systems. We are in PhD land here. They are complicated and most of us have no idea how they actually work, what the real dependencies are between software modules. It’s like that old chestnut that when a butterfly flaps its wings in China, you might end up with a hurricane in the Gulf of Mexico. When the hard drive of a system running a non-essential monitoring app in an AWS region in North America fails but somehow causes a system wide failure, this is what I'm talking about.

These systems are complicated and humans can’t possibly understand all the permutations in their head. Software engineers think they know, and DevOps and SRE teams write linear regression tests for things they assume to be true. But those teams don’t learn anything new by doing so. They test properties of the system that are already known, like previously corrected defects or boundary conditions or the main features of a product. According to Casey Rosenthal and Nora Jones in their book on Chaos Engineering, these kinds of linear regression tests, “... require that the engineer writing the test knows specific properties about the system that they are looking for.” Chaos Engineering in contrast is the pursuit of the unknown. According to Rosenthal and Jones, it’s “the facilitation of experiments to uncover systemic weaknesses” that we had no idea existed before. They don’t replace linear regression tests, they are trying to solve a different problem by uncovering unknown and, as yet, undiscovered design faults.

Chaos engineering is built on the scientific method. DevOps teams develop a hypothesis around steady-state behavior and run experiments in production to see if the hypothesis holds. If they discover a difference in steady state between the control group and the experimental group on production systems, then they have learned something new. If not, they have gained more confidence in their hypothesis. They use techniques to minimize the blast radius on the production system and monitor the entire experiment carefully to ensure no catastrophic effect, but they have to be on the production system to do it.

Typical Chaos Engineering experiments might involve throttling bandwidth down to zero or spiking CPU percentage so that the work instance can’t perform the steady state behavior. In the resilience essay I wrote last year, I pointed to Netflix as the poster child for this new tactic. I said that Netflix routinely runs an app, called Chaos Money, that randomly destroys pieces of their customer facing infrastructure, on purpose, so that their network architects understand resilience engineering down deep in their core.” When I first learned about this technique, I was stunned by the audacity, and seemingly recklessness, of the approach. In my past career, I would never destroy parts of my production system on purpose for an experiment. I may do it by mistake, but never on purpose. In hindsight, as I have learned more about the subject, that’s not exactly how Chaos Engineering works. It’s audacious for sure, but the Netflix Chaos engineering system is mature and their DevOps teams have been developing the practice since 2008. They have learned how to do this and their experts wouldn’t recommend that newbies to the idea start by destroying parts of their production system. You have to ease into it.

The bottom line is that, according to Aaron Rinehart and Kelly Shortridge in their own book, they say that Chaos Engineering is “The identification of security control failures through proactive experimentation to build confidence in the system’s ability to defend against malicious conditions in production.” It’s not randomly breaking stuff in production to see what happens.

History of Chaos Engineering.

As with many disruptive security ideas (Google’s adoption of Zero Trust when it got hit by several different Chinese APT groups in 2010 and the emergence of the DevOps movement when giant and expensive development projects failed by teams using the old Waterfall model in the 1980s), Chaos Engineering began back in 2008 with a couple of delivery failures at Netflix. The company was transitioning from a mail-a-movie-DVD-to-its-customers company to a deliver-the-movie-via-streaming company. The Netflix leadership team very publicly announced its commitment to adopt AWS cloud services and abandon its own data centers. This was a big idea since Amazon just rolled out the service two years before and it wasn’t what anybody would claim as mature yet.

The Netflix move-to-the-cloud precipitating event was a database failure in 2008 that prevented the company from delivering DVDs to their customers for three days. That obviously didn’t meet my Stockholm University resilience criteria. Further, that Christmas in 2008, AWS suffered a major outage that prevented Netflix customers from using the new streaming service. In response, Netflix engineers developed their first Chaos Engineering product in 2010, called Chaos Monkey, that helped them counter the vanishing instance problem caused by the AWS outage. With that success in their pocket, Netflix began building their own Chaos Engineering team and wondered if they could scale it. If they could fix the small scale vanishing instances problem, could they do the same at the vanishing region scale?

In fairness, Netflix wasn’t the only company thinking along these lines. Back in 2006, Google site reliability engineers (SREs) established their own Disaster Recovery Testing program (DiRT) to intentionally insert failures into their internal systems to discover unknown risks. But their cool name for it (DiRT) wasn’t as cool as the Netflix name (Chaos Monkey) and it didn’t catch on. The idea was similar though.

By 2011, Netflix began adding new failure modules that provided a more complete suite of resilience features. Those modules eventually became known as the Netflix Simian Army and include colorful names like Latency Monkey, Conformity Monkey, and Doctor Monkey, just to name three. There are many more.

In 2012, Netflix shared the source code for Chaos Monkey on Github and by 2013, other organizations started playing with the idea (like Capital One). By 2014, Netflix created a new employee role (Chaos Engineer) and began working on ideas of reducing the blast radius of planned injected failures. By 2016, Netflix had an entire team of Chaos Engineers working on the Simian Army. By this time, there was a small but growing contingent of silicon valley companies experimenting with the idea.

What does Chaos Engineering have to do with security?

I hear what you’re thinking. “Rick, this all sounds interesting and all of that, but what does it have to do with security? This sounds like something the CIO needs to worry about.” I don’t disagree that, traditionally, linear regression tests, SRE and DevOps teams, and IT resilience have generally been the purview of the CIO. But, if you buy into the entire cybersecurity first principle idea, resilience is a key and essential strategy to prevent material impact of an organization due to a cyber event. It’s as important as zero trust, intrusion kill chain prevention, and risk forecasting.

There are definite divisions of labor for resilience though. The CIO is handling the DevOps piece of that and the CSO needs to be part of the team. But I’m making the case that Chaos Engineering is something that should be owned by the CSO. Who better to discover potential unknown systemic failures that might impact production or the ability to recover from an event quickly? The CIO handles the known stuff. The CSO’s job description should be to discover unknown faults in the system that will cause material damage.

According to Rinehart and Shortridge, they say that traditional security programs orbit around failure avoidance. Infosec teams design and implement people, process, and technology policy designed to prevent the organization from getting anywhere near a disaster. In contrast, they say that failure is where an infosec team learns the most. I agree. If you can build these small experiments that uncover potential systematic failure, that might be the most valuable thing an infosec team does.

Rinehart and Shortridge say that this mindset changes the infosec team’s focus away from building a purely defensive posture and towards something that is adaptive. Instead of seeking defensive perfection, pursue the ability to handle failure gracefully. They recommend that the infosec community move away from security theater (a concept made famous by one of cybersecurity’s thought leaders, Bruce Schneier). This is the idea that infosec teams perform work that creates the perception of improved security. One example of this could be the purchase of an anti-phishing product that delivers approved phishing email messages to employees to train them not to click bad URLs. Or another is building an insider threat program designed to prevent employees from taking their old powerpoint slides with them to their next job. In the big scheme of things, are those kinds of security theater programs as impactful as discovering a previously unknown fault in the organization’s system design that could cause catastrophic failure? The notion is worth considering.

Specifically with respect to traditional security, however, Rinehart and Shortridge suggest that you could apply the Chaos Engineering idea to things like red teaming. Instead of turning loose the Red Team to find some hole in the defensive posture, we could instead develop a hypothesis around how the organization should react to a specific attack sequence, say one of Panda Bear’s. If we treat red-teaming exercises as a science experiment with a hypothesis that defines how we think the organization will react to a Panda Bear-like attack, we might learn something new. If that’s true, we could expand this kind of thinking to all sorts of traditional security tasks like container security, CI/CD pipeline security, security monitoring, incident response, and so forth. You might say you’re already doing those things. But what I’m suggesting is a subtle shift away from rudimentary tests of the system with things we already know about and towards the more advanced scientific method designed to uncover the things we don’t already know.

That said, Chaos Engineering is not for everybody. It’s just another tactic that we might use to reduce the probability of material impact due to a cyber event. It’s another arrow in our quiver to build our resilience program alongside the other arrows like crisis planning, incident response, backups, and encryption. The concept is probably a bridge too far for most small to medium sized organizations who struggle to find resources just to keep the lights on. But, for big silicon valley companies that deliver services from around the world (the Netflixes, the Googles, the Linked-Ins, etc) and for most Fortune 500 companies, Chaos Engineering is something to consider. Indeed, many of these companies are already far down the path.

Chaos Engineering timeline.

2006

Google’s DiRT (Disaster Recovery Testing) program was founded by site reliability engineers (SREs) to intentionally instigate failures in critical technology systems and business processes in order to expose unaccounted for risks.

2008

Netflix made a very public display of moving from the datacenter to the cloud.
In August, they reacted to a major database corruption event in the datacenter which left Netflix unable to ship DVDs for three days.
Christmas Eve. AWS suffered a rolling outage of elastic load balancers (ELBs) across regions. Since Netflix’s control plane ran on AWS, customers were not able to choose videos and start streaming them.

2010

The Netflix Engineering Tools team created Chaos Monkey in response to Netflix’s move from physical infrastructure to cloud infrastructure provided by Amazon Web Services, and the need to be sure that a loss of an Amazon instance wouldn’t affect the Netflix streaming experience.

2011

The Simian Army added additional failure injection modes on top of Chaos Monkey that would allow testing of a more complete suite of failure states, and thus build resilience to those as well.

2012

Netflix shared the source code for Chaos Monkey on Github.

2013

Capital One starts Chaos Engineering with something called “Blind Resiliency Testing.”

2014

Netflix decided they would create a new role: the Chaos Engineer.
In October, Netflix announced Failure Injection Testing (FIT), a new tool that built on the concepts of the Simian Army, but gave developers more granular control over the “blast radius” of their failure injection.

2015:

Netflix established a Chaos Engineering Team
Casey Rosenthal created a community of practice and organized “Chaos Community Day” in the Autumn held in Uber’s office in San Francisco: Netflix, Google, Amazon, Microsoft, Facebook, DropBox, WalmartLabs, Yahoo!, LinkedIn, Uber, UCSC, Visa, AT&T, NewRelic, HashiCorp, PagerDuty, and Basho.

2017

Linked-In began its Chaos Engineering program with Project Waterbear.

2018

Gremlin launches Chaos Conf, the first large-scale conference dedicated to Chaos Engineering. In just two years, the number of attendees would grow by nearly 10x and include experts from software, retail, finance, delivery, and many other industries.
Slack experiments with it's Disasterpiece Theater with more than twenty exercises

2020

AWS adds Chaos Engineering to the reliability pillar of the AWS Well-Architected Framework (WAF).
AWS announces Fault Injection Simulator (FIS), a fully managed service for natively running chaos experiments on AWS services.

2021

Gremlin publishes the first ever State of Chaos Engineering report. The report shows how the practice of Chaos Engineering has grown among organizations, key benefits of Chaos Engineering, how often top performing teams run chaos experiments, and more.

References.

01 JUN 2020:

CSOP S1E9:: Cybersecurity first principles - resilience

Hash Table Guests: None
Link: Podcast (9)
Link: Transcript
Link: Essay

08 JUN 2020:

CSOP S1E10:: Cybersecurity first principles - DevSecOps

Hash Table Guests: None
Link: Podcast (10)
Link: Transcript
Link: Essay

02 AUG 2021

CSOP S6E3: Pt 1 - Cybersecurity first principles - backups.

Hash Table Guests: None
Link: Podcast (66)
Link: Transcript
Link: Essay and Podcast

09 AUG 2021

CSOP S6E4: Pt 2 - Cybersecurity first principles - backups.

Hash Table Guests: TBD
Jerry Archer, Sallie Mae CSO (5)
Jaclyn Miller, NTT CISO (1)
Link: Podcast (67)
Link: Transcript
No Essay

6 June 2022:

CSOP S9E6: Resilience Case Study: Colonial Pipeline attacks of 2021 (Cyber Sandtable).

Hash Table Guests: N/A
Link: Podcast (93)
Link: Transcript
Link: Essay and Podcast

“Bruce Schneier’s New View on Security Theater,” by Peter Glaskowsky, CNET, 10 April 2008.

“Chaos Engineering,” by arvindpdmn, Govindpathak, and devbot5S, Devopedia Foundation, 28 December 2021.

“How Chaos Engineering Can Help DevSecOps Teams Find Vulnerabilities,” Peter Wayner, CSO Online, 19 January 2022.

"Chaos Engineering: Open-sourcing Netflix’s chaos generator, Chaos Monkey,” by Cloud_Freak, Medium, 8 September 2019, Last Visited 30 April 2020

“Chaos Engineering: Site reliability through controlled disruption,” by Mikolaj Pawlikowski, Published by Manning, 16 March 2021.

“Chaos Engineering: System Resiliency in Practice,” by Casey Rosenthal, Nora Jones, and Nathan Aschbacher, Published by O'Reilly Media, 28 April 2020.

“Chaos Engineering: The History, Principles, and Practice,” by Tammy Butow, Gremlin.com, 5 May 2021.

“Chaos Monkey & Chaos Gorilla,” by Shimon Brathwaite, Secjuice, 13 December 2020.

“Chaos Monkey at Netflix: The Origin of Chaos Engineering,” by Gremlin.com, 17 October 2018.

"Cyber Resilience – Fundamentals for a Definition,” by Fredrik Björck, Martin Henkel, Stockholm University, Janis Stirna, Jelena Zdravkovic, Stockholm University, Article in Advances in Intelligent Systems and Computing, January 2015.

“DevOps Case Study: Netflix and the Chaos Monkey,” by C. AARON COIS, SEI Blog, 30 APRIL 2015.

“Netflix Chaos Monkey Upgraded,” Netflix TechBlog, Medium, 19 October 2016.

“Security Chaos Engineering,” by Aaron Rinehart and Kelly Shortridge, Published by O'Reilly Media, December 2020.

“Security Chaos Engineering Helps You Find Weak Links in Your Cyber Defenses before Attackers Do,” by Veronica Combs, TechRepublic, 1 February 2021.

“Security Chaos Engineering: How to Security Differently,” by Aaron Rinehart, Verica, 3 March 2021.

“State of Chaos Engineering 2021,” by Kolton Andrus, Gremlin, 2021.

“The Netflix Simian Army,” by Yury Izrailevsky and Ariel Tseitlin, Medium, Netflix TechBlog, 19 July 2011.