Resilience Case Study: Chaos Engineering.

Transcript

Rick Howard: When I was a young captain back in 1985, Army personnel, in its infinite wisdom, assigned me as the communications officer to the 3rd of the 19th Field Artillery Battalion located at Fort Polk, La. Ronald Reagan was the president at the time, in his second term, and the United States military hadn't seen any real combat since the end of the Vietnam War in 1973. It was a decent time to be in the military because nobody was shooting at us. Bosnia was another seven years away, and President Reagan convinced Congress to give us a lot of training money.

Rick Howard: One of the key ways you got promoted back in those days was to do well at the National Training Center, or NTC, at Fort Irwin, Calif. The Army would ship entire divisions - in my case, the 5th Mechanized Infantry Division - to the California desert to do battle with a world-class operational force, or OPFOR, who took pride in emulating the Soviet army, the ultimate red team. It didn't matter what rank you were, private to general. If your unit did well at that two-week training exercise, leadership put a positive mark on your promotion potential.

Rick Howard: I was assigned to the division's Field Artillery Battalion, specifically to help them succeed in that exercise. Back then, communications was largely a matter of line-of-sight radios, giant things called AN/GRC-47s. They were about the size of the old IBM personal computer but much heavier. You would slap them into racks inside of personnel carriers or on the back of Jeeps attached to these long whip antennas, about 8 feet.

Rick Howard: But they were expensive, and you didn't have spares standing around to replace them if they failed. And let me tell you, during a two-week field exercise in the middle of summer deep in the Mojave Desert, those things, with their solid-state electronics, failed a lot. And compared to an infantry battalion that might have one or two channels operational during a battle, the Field Artillery had at least 10. That meant that, out of a 500-man battalion, just about 80 radios had to be operational at all times. If any one of them failed at the wrong time, it might mean that the battalion would tank the exercise. It was a no-win scenario for my unit and perhaps for my career. With no spares to speak of, what was a poor communications officer like me to do?

Rick Howard: Well, I cheated. By the time I joined the 3rd of the 19th, I had been stationed at Fort Polk for about three years. I knew people. It was a small place. I went around to all my buddies who weren't participating in the exercise and borrowed their radios. I got one from Joe and two from Larry and a handful from Sue and shipped them all out to the NTC with the rest of my equipment. During the exercise, when a radio would fail, I would just replace it with a spare and send the broken radio back to the rear for repair. With that quick turnaround from the repair shop - about a day - and the spares I had on hand, everybody always had a working radio.

Rick Howard: At the end of the exercise, the NTC training evaluators single me out specifically in the after-action review as managing some of the best communication systems they had ever seen. Looking back - through, admittedly, some rose-colored glasses - at that time, you know, I was thinking about how Captain Kirk defeated the Kobayashi Maru test in my favorite "Star Trek" movie, "The Wrath of Khan."

(SOUNDBITE OF FILM, "STAR TREK II: THE WRATH OF KHAN")

Kirstie Alley: (As Saavik) Sir, may I ask you a question?

William Shatner: (As Captain Kirk) What's on your mind, Lieutenant?

Kirstie Alley: (As Saavik) The Kobayashi Maru, sir.

William Shatner: (As Captain Kirk) Are you asking me if we're playing out that scenario now?

Kirstie Alley: (As Saavik) On the test, sir, will you tell me what you did? I would really like to know.

William Shatner: (As Captain Kirk) Lieutenant, you are looking at the only Starfleet cadet who ever beat the no-win scenario.

Kirstie Alley: (As Saavik) How?

William Shatner: (As Captain Kirk) I reprogrammed the simulation so it was possible to rescue the ship.

Kirstie Alley: (As Saavik) What?

Merritt Butrick: (As David Marcus) He cheated.

William Shatner: (As Captain Kirk) I changed the conditions of the test - got a commendation for original thinking. I don't like to lose.

Kirstie Alley: (As Saavik) Then you never faced that situation, faced death.

William Shatner: (As Captain Kirk) I don't believe in the no-win scenario. Captain Kirk, played by the indomitable William Shatner, defeated the Kobayashi Maru test - a test of character that placed cadets in no-win scenarios - by cheating. I'm just saying. Oh, and just as an aside, when Paramount rebooted the franchise with Chris Pine playing Kirk many years later, the director, J.J. Abrams, dramatized how Kirk did it. In a deleted scene, a scene that didn't make the theatrical release, Abrams shows Kirk inserting a piece of malware at the right time to change the Kobayashi Maru simulation so that he can win.

(SOUNDBITE OF TV SHOW, "HOME IMPROVEMENT")

Tim Allen: (As Tim Taylor) Oh yeah (laughter).

Rick Howard: And by the way, I kept track of all of those radios - the serial numbers, operational status, locations, those kinds of things - with my very first personal computer, an Apple IIc, complete with color card and the VisiCalc spreadsheet. Man, I am old.

(SOUNDBITE OF TV SHOW, "HOME IMPROVEMENT")

Tim Allen: (As Tim Taylor) Oh, no.

(LAUGHTER)

Rick Howard: And you may be wondering why I told you that very long story about my dinosaur days in the U.S. Army. What could broken radios have to do with senior executives considering cybersecurity first principles in 2021? I'm glad that you asked. This is all about resilience; quote, "the ability to continuously deliver the intended outcome despite adverse cyber events," end quote. That definition comes from a paper published in 2020 by some Stockholm University researchers. It's exactly what I was trying to do at the NTC and exactly opposite of what the Colonial Pipeline Company did when it was attacked with ransomware by the cybercriminal group DarkSide. I talked all about that in our last episode.

Rick Howard: In the traditional cybersecurity sense, though, especially defending against ransomware, most infosec practitioners think that resilience is some combination of a good backup plan, a robust encryption program for material data and, for bonus points, a mature failover system. In other words, if the cybercriminals encrypt your data, recover your well-tested backup files. If they also steal your data in a double-extortion scheme to sell to third parties, make sure that your material data is encrypted. Finally, if they destroy key production systems, have a hot standby system ready to go. Or you could just pay the ransom and pray that the bad guys hold up their end of the bargain.

Rick Howard: All of those tactics have been established as best practice for at least a decade, and maybe even longer. But some of the big Silicon Valley companies, like Netflix, Google, LinkedIn, Microsoft and others, have taken the idea of resilience to the next level. Their leadership has embraced the notion of something called chaos engineering. It's the concept that, instead of waiting for your systems to fail - which, by the way, they will for any set of relatively complex systems - and hope for the best, they instead cause them to fail on purpose to observe if the resilience system in place actually behaves the way that they think it should. Chaos engineering is resilience planning at the next level and something we should all be studying.

Rick Howard: My name is Rick Howard, broadcasting from the CyberWire's secret sanctum sanctorum studios, located underwater, somewhere along the Patapsco River near Baltimore Harbor. And you're listening to "CSO Perspectives," my podcast about the ideas, strategies and technologies that senior security executives wrestle with on a daily basis.

Rick Howard: In order to understand what chaos engineering is, we have to first accept the fact that we no longer live in a linear digital world. When the internet emerged as a useful business tool - say, the mid-1990s - things were pretty simple. We didn't think so at the time, but compared to today, that world was kindergarten. If you changed one thing in that environment, you pretty much knew what was going to happen.

Rick Howard: But today's IT environments are systems of systems. We are in Ph.D. land here. They are complicated, and most of us have no idea how they actually work. Like, what are the real dependencies between software modules? It's like that old chestnut that when a butterfly flaps its wings in China, you might end up with a hurricane in the Gulf of Mexico. When the hard drive of a system running a nonessential monitoring app in an AWS region in North America fails but somehow causes a system-wide failure, this is what I'm talking about.

(SOUNDBITE OF TV SHOW, "HOME IMPROVEMENT")

Tim Allen: (As Tim Taylor) Oh, no.

(LAUGHTER)

Rick Howard: And humans can't possibly understand all the permutations in their head. Software engineers think they know, and DevOps and SRE teams write linear regression tests for things they assume to be true, but those teams don't learn anything new by doing so. They test properties of the system that are already known, like previously corrected defects or boundary conditions or the main features of a product. According to Casey Rosenthal and Nora Jones in their book on chaos engineering, these kinds of linear regression tests, quote, "require that the engineer writing the test knows specific properties about the system that they are looking for," end quote.

Rick Howard: Chaos engineering, in contrast, is the pursuit of the unknown. According to Rosenthal and Jones, it's, quote, "the facilitation of experiments to uncover systemic weaknesses," end quote. They don't replace linear regression tests, they are trying to solve a different problem by uncovering unknown and as-yet-undiscovered design faults. Chaos engineering is built on the scientific method. DevOps teams develop a hypothesis around steady-state behavior and run experiments in production to see if that hypothesis holds. If they discover a difference in steady state between the control group and the experimental group on production systems, then they have learned something new. If not, they have gained more confidence in their hypothesis. They use techniques to minimize the blast radius on the production system and monitor the entire experiment carefully to ensure no catastrophic effect, but they have to be on the production systems to do it. Typically, chaos engineering experiments might involve throttling bandwidth down to zero or spiking CPU percentage so that the work instance can't perform the steady state behavior.

Rick Howard: In the Resilience podcast I did last year, I pointed to Netflix as the poster child for this new tactic. I said that Netflix routinely runs an app called Chaos Monkey that randomly destroys pieces of their customer-facing infrastructure on purpose so that their network architects understand resilience engineering down deep in their core. When I first learned about this technique, I was stunned by the audacity and, seemingly, recklessness of the approach. In my past career, I would never destroy parts of my production system on purpose for an experiment. I may do it by mistake but never on purpose.

Rick Howard: In hindsight, as I learned more about the subject, that's not exactly how chaos engineering works. It's audacious, for sure, but the Netflix chaos engineering system is mature, and their DevOps teams have been developing the practice since 2008. They have learned how to do this, and their experts wouldn't recommend that newbies to the idea start by destroying parts of their production system. You kind of have to ease into it. The bottom line is that, according to Aaron Rinehart and Kelly Shortridge in their own book, they say that chaos engineering is, quote, "the identification of security control failures through proactive experimentation to build confidence in the system's ability to defend against malicious conditions in production," end quote. It's not randomly breaking stuff in production to see what happens.

Rick Howard: As with many disruptive security ideas, like Google's adoption of zero trust when it got hit by several different Chinese APT groups in 2010 and the emergence of the DevOps movement, when giant and expensive development projects failed by teams using the old waterfall model in the 1980s, chaos engineering began back in 2008 with a couple of delivery failures at Netflix. The company was transitioning from a mail-a-movie-DVD-to-its-customers company to a deliver-the-movie-via-streaming company. The Netflix leadership team very publicly announced its commitment to adopt AWS Cloud Services and abandon its own data centers. That was a big idea since Amazon just rolled out the service two years before, and it wasn't what anybody would claim as mature yet.

Rick Howard: The Netflix move-to-the-cloud precipitating event, though, was a database failure in 2008. It prevented the company from delivering DVDs to their customers for three days. That, obviously, didn't meet my Stockholm University resilience criteria. Further, that Christmas, in 2008, AWS suffered a major outage that prevented Netflix customers from using the new streaming service. In response, Netflix engineers developed their first chaos engineering product in 2010, called Chaos Monkey, that helped them counter the vanishing instance problem caused by the AWS outage. With that success in their pocket, Netflix began building their own chaos engineering team and wondered if they could scale it. If they could fix the small-scale vanishing instance problem, could they do the same at the vanishing region scale?

Rick Howard: Now, in fairness, Netflix wasn't the only company thinking along these lines. Back in 2006, Google's site reliability engineers, or SREs, established their own disaster recovery testing program, or DiRT, as they called it, to intentionally insert failures into their internal systems to discover unknown risk. But their cool name for it, DiRT, wasn't as cool as the Netflix name, Chaos Monkey, and it didn't catch on. The idea was similar, though.

Rick Howard: By 2011, Netflix began adding new failure models that provided a more complete suite of resilience features. These modules eventually became known as the Netflix Simian Army and included colorful names like Latency Monkey, Conformity Monkey and Doctor Monkey just to name three. There are many more. In 2012, Netflix shared the source code for Chaos Monkey on GitHub, and by 2013, other organizations started playing with the idea, like Capital One. By 2014, Netflix created a new employee role, Chaos Engineer, and began working on ideas to reduce the blast radius of planned injected failures. By 2016, Netflix had an entire team of chaos engineers working on the Simian Army. By this time, there was a small but growing contingent of Silicon Valley companies experimenting with the idea.

Rick Howard: I hear what you're saying. You're saying, Rick, that all sounds interesting and all of that, but what does it have to do with security? This sounds like something the CIO needs to worry about. I don't disagree that, traditionally, linear regression tests, SRE and DevOps teams and IT resilience have generally been the purview of the CIO. But if you buy into the entire cybersecurity first principle idea, resilience is a key and essential strategy to prevent material impact of an organization due to a cyber event. It's as important as zero trust, intrusion kill chain prevention and risk forecasting. There are definite divisions of labor for resilience, though. The CIO is handling the DevOps piece of that, and the CSO needs to be part of that team, but I'm making the case here that chaos engineering is something that should be owned by the CSO. Who better to discover potential unknown systemic failures that might impact production or the ability to recover from an event quickly? The CIO handles the known stuff. The CSO's job description should be to discover unknown faults in the system that will cause material damage.

Rick Howard: According to Rinehart and Shortridge, they say that traditional security programs orbit around failure avoidance. Infosec teams design and implement people, process and technology policy designed to prevent the organization from getting anywhere near a disaster. In contrast, they say that failure is where an infosec team learns the most, and I agree. If you can build these small experiments that uncover potential systemic failure, that might be the most valuable thing an infosec team does. Rinehart and Shortridge say that this mindset changes the infosec team's focus away from building a purely defensive posture and towards something that is adaptive. Instead of seeking defensive perfection, pursue the ability to handle failure gracefully.

Rick Howard: They recommend the infosec community move away from security theater, a concept made famous by one of cybersecurity's thought leaders, Bruce Schneier. This is the idea that infosec teams perform work that creates the perception of improved security. One example of this could be the purchase of an anti-phishing product that delivers approved phishing email messages to employees to train them not to click bad URLs. Or another is building an insider threat program designed to prevent employees from taking their old PowerPoint slides with them to their next job. In the big scheme of things, are those kinds of security theater programs as impactful as discovering a previously unknown fault in the organization's system design that could cause catastrophic failure? The notion is worth considering.

Rick Howard: Specifically, with respect to traditional security, however, Rinehart and Shortridge suggest that you could apply the chaos engineering idea to things like red-teaming. Instead of turning loose the red team to find some hole in the defensive posture, we could instead develop a hypothesis around how the organization should react to a specific attack sequence - say, one of panda bears. If we treat red-teaming exercises as a science experiment, with a hypothesis that defines how we think an organization will react to a panda-bear-like attack, we might learn something new. If that's true, we could expand this kind of thinking to all sorts of traditional security tasks, like container security, CI/CD pipeline security, security monitoring, incident response and so forth. You might say you're already doing those things, but what I'm suggesting is a subtle shift away from rudimentary tests of the system with things we already know about and towards the more advanced scientific method designed to uncover the things we don't already know.

Rick Howard: That said, chaos engineering is not for everybody. It's just another potential tactic that we might use to reduce the probability of material impact due to a cyber event. It's another arrow in our quiver to build our resilience program alongside the other arrows, like crisis planning, incident response, backups and encryption. The concept is probably a bridge too far for most small- to medium-sized organizations, though, who struggle to find resources just to keep the lights on. But for big Silicon Valley companies that deliver services from around the world - the Netflixes, the Googles, the LinkedIns, etc. - and for most Fortune 500 companies, chaos engineering is something to consider. Indeed, many of these companies are already far down that path.

Rick Howard: And that's a wrap - not only for this episode, but for the entire "CSO Perspectives" season. Thanks for coming along with me on this journey. In this season, we've talked a little history of infosec, software bill of materials, single sign-on, two-factor authentication, software-defined perimeter, intelligence sharing, the Colonial Pipeline attacks of 2019 and this last episode on chaos engineering. Whew. That's quite a run. But for the next few weeks, the thousands of interns we have here at the CyberWire and me are taking July off in order to put the finishing touches on Season 10, which starts on 1 August. You don't want to miss that. And by the way, we're going to hit our 100th episode in Season 10. How cool is that?

(SOUNDBITE OF TV SHOW, "HOME IMPROVEMENT")

Tim Allen: (As Tim Taylor) Oh, yeah (laughter).

(APPPLAUSE)

Unidentified Person: Whoo hoo (ph).

Rick Howard: In the meantime, as always, if you agree or disagree with anything I have said, hit me up on LinkedIn or Twitter, and we can continue the conversation there. Or, if you prefer, drop a line to csop@thecyberwire.com. That's C-S-O-P, the at sign, the CyberWire - all one word - dot com. And if you have any questions you would like us to answer here at "CSO Perspectives," send a note to the same email address, and we will try to answer them in the show.

Rick Howard: The CyberWire's "CSO Perspectives" is edited by John Petrik and executive produced by Peter Kilpe. Our theme song is by Blue Dot Sessions, remixed by the insanely talented Elliott Peltzman, who also does the show's mixing, sound design and original score. And I am Rick Howard. Thanks for listening.

HOST(S):

Rick Howard is the CSO of N2K and the Chief Analyst, and Senior Fellow at the N2K Cyber, formerly CyberWire. His past lives include CSO at Palo Alto Networks, CISO at TASC, the GM at Verisign/iDefense, the Counterpane SOC Director, and the Commander of the Army's Computer Emergency Response Team (CERT). Rick served 25 years in the Army, taught computer science at West Point, edited two books and just published his own book, "Cybersecurity First Principles: A Reboot of Strategy and Tactics" and he is regularly joined at the N2K Cyber's Hash Table by a collection of industry experts.

Schedule: Mondays (in season)

Credits: Edited by John Petrik and executive produced by Peter Kilpe. Our theme song is by Blue Dot Sessions, remixed by the insanely talented Elliott Peltzman, who also does the show's mixing, sound, and original score

Creator: CyberWire, Inc.