CSO Perspectives (Pro) 1.30.23
Ep 97 | 1.30.23

Chaos engineering.

Transcript

(SOUNDBITE OF KLAUS BADELT AND HANS ZIMMER'S "HE'S A PIRATE")

Rick Howard: You're listening to "He's a Pirate" from the "Pirates of the Caribbean" soundtrack, written by Klaus Badelt in 2003, because that music kind of reminds me of what we're talking about today, chaos engineering. I love that name. When I hear it, the name conjures some kind of buccaneering spirit and swashbuckling attitude that I associate with old pirate movies like the Dread Pirate Roberts in "The Princess Bride"... 

(SOUNDBITE OF FILM, "THE PRINCESS BRIDE") 

Chris Sarandon: (As Prince Humperdinck) To the death. 

Cary Elwes: (As Westley) No. To the pain. 

Chris Sarandon: (As Prince Humperdinck) I don't think I'm quite familiar with that phrase. 

Cary Elwes: (As Westley) I'll explain, and I'll use small words that you'll be sure to understand, you warthog-faced buffoon. 

Rick Howard: ...Or, for you old-timers out there, Errol Flynn in "Captain Blood." 

(SOUNDBITE OF FILM, "CAPTAIN BLOOD") 

Basil Rathbone: (As Levasseur) Wait. You'll not take her while I live. 

Errol Flynn: (As Peter Blood) Then I'll take her when you're dead. 

Rick Howard: The idea of chaos engineering came on the scene around the same time DevOps started to get popular. I feel like both ideas are cut from the same skull-and-crossbones black cloth used to make pirate flags. DevOps didn't really hit my radar screen until about 2015 or so when I discovered Gene Kim's book, "The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win." DevOps the idea originated with Patrick Debois in 2009, but it took a while for the concept to percolate through the IT channels and eventually arrive in the security domain. Kim's book was my introduction to the concept. When I heard about it, I thought DevOps or DevSecOps would be the new, disruptive idea that would fundamentally change how the security community operates. Before DevSecOps, most security practitioners didn't do much software development. The ones who did wrote their own tools to accomplish some task, but they weren't generally part of the organization's development team. After DevSecOps became a thing, say, over the last five years or so, I thought security people would transition to becoming software developers first and security experts second. That hasn't materialized, at least not yet. 

Steve Winterfeld: Rick, not even close. Come on. 

(SOUNDBITE OF TV SHOW, "HOME IMPROVEMENT") 

Tim Allen: (As Tim Taylor) Oh, no. 

Rick Howard: There are probably many reasons for this. One reason might be that, traditionally, the InfoSec community doesn't consider automation to fall within their purview of the security professional. That has been a giant mistake in first-principle thinking. Because of that error, the IT community has sprinted away from the security community in pursuing advanced software development techniques or DevOps. And one of those techniques is the topic for today, chaos engineering. 

Rick Howard: My name is Rick Howard, and I'm broadcasting from the CyberWire's secret sanctum sanctorum studios, located underwater somewhere along the Patapsco River near Baltimore Harbor, Md., in the good old US of A. And you're listening to "CSO Perspectives," my podcast about the ideas, strategies and technologies that senior security executives wrestle with on a daily basis. 

Rick Howard: Chaos engineering is the resilience discipline of controlled stress-test experimentation in CICD, or continuous integration and continuous delivery, environments to uncover systemic weaknesses. Chaos engineers build hypotheses around expected software behavior, design a small-footprint experiment - they call it a tiny blast radius - that varies steady-state behavior, like bandwidth use and CPU use, and then run those experiments in production systems to learn about unknown design flaws. Admittedly, this is an advanced tactic for the resilience first-principle strategy. It isn't for the small, medium or even some large companies. But if your organization provides global digital services that absolutely have to be running 24 by 7 without any downtime, then you likely have a team of chaos engineers somewhere performing these experiments. To understand why these global service providers require chaos engineering, you first must accept the fact that we no longer live in a linear digital world. When the internet emerged as a useful business tool, say, the 1990s... 

Steve Winterfeld: Oh, my God. Here we go - more history. 

Rick Howard: ...Things were pretty simple. We didn't think so at the time, but compared to today, that world was kindergarten. If you changed one thing in that world, you pretty much knew what was going to happen. But today's IT environments are systems of systems. We're in Ph.D. land here. They are complicated, and most of us have no idea how they actually work and what the real dependencies are between all the software modules deployed on all of our data islands. According to Rosenthal, Jones and Aschbacher in their book "Chaos Engineering: System Resiliency in Practice," they say that a change to input of a linear system produces a corresponding change to the output of the system. Nonlinear systems have output that varies widely based on the changes to the constituent parts. It's like that old chestnut - when a butterfly flaps its wings in China, you might end up with a hurricane in the Gulf of Mexico. 

Rick Howard: When the hard drive of a system running a nonessential monitoring app in an AWS region in North America fails but somehow causes a systemwide failure, this is what I'm talking about. The FAA, the Federal Aviation Administration, just experienced this in January of 2023. A report from CNBC says that a contractor unintentionally deleted one or more database files while working to correct synchronization between a live primary database and a backup and caused a failure in the NOTAM system, the Notice to Air Mission system, which grounded commercial flights across the U.S. Yikes. 

(SOUNDBITE OF TV SHOW, "HOME IMPROVEMENT") 

Tim Allen: (As Tim Taylor) Oh, no. 

Rick Howard: These systems are complicated, and humans can't possibly understand all the permutations in their head. Software engineers think they know, and DevOps and SRE teams, the site reliability engineering teams, write linear regression tests for things they assume to be true. But those teams don't learn anything new by doing so. They test properties of the system that are already known, like previously corrected defects and boundary conditions of the main features of a product. Rosenthal, Jones and Aschbacher say that these kinds of linear regression tests require that the engineer writing the test knows specific properties about the system that they're looking for. Chaos engineers don't replace linear regression tests. They are trying to solve a different problem by uncovering unknown and, as yet, undiscovered design faults. 

(SOUNDBITE OF TV SHOW, "THE TWILIGHT ZONE") 

Rod Serling: You unlock this door with the key of imagination. Beyond it is another dimension. 

Rick Howard: Chaos engineering is built on the scientific method. DevOps teams develop a hypotheses around steady state behavior and run experiments in production to see if the hypothesis holds. If they discover a difference in steady state between the control group and the experimental group on production systems, then they've learned something new. If not, they have gained more confidence in their hypothesis. They use techniques to minimize the blast radius on the production system and monitor the entire experiment carefully to ensure there will be no catastrophic effect. But they have to be on the production system to do it. 

Rick Howard: When I've talked about resilience as a first principle strategy in the past, I've always pointed to Netflix as the poster child for this new tactic. I said that Netflix routinely runs an app, like Chaos Monkey, that randomly destroys pieces of their customer-facing infrastructure on purpose so that their network architects understand resilience engineering down deep in their core. When I first learned about this technique, I was stunned by the audacity and seeming recklessness of the approach. In my past career, I would never destroy parts of my production system on purpose for an experiment. I might do it by mistake, but never on purpose. In hindsight, as I've learned more about the subject, that's not exactly how chaos engineering works. It's audacious, for sure. But the Netflix chaos engineering system is mature, and their DevOps teams have been developing the practice since 2008. They've learned how to do this, and they're experts wouldn't recommend that newbies to the idea start by destroying parts of their production system. You have to kind of ease into it. 

Rick Howard: After the break, we'll take a look at the history of chaos engineering, how we got here and how it all fits in to our first principle strategy of resilience. Come right back. 

Rick Howard: Chaos engineering began back in 2008. 

Steve Winterfeld: Rick, nobody cares about the history. Let's move on. 

Rick Howard: Netflix experienced a couple of delivery failures when they were transitioning from a DVD mailing company to a streaming company. The Netflix leadership team very publicly announced its commitment to adopt AWS cloud services and abandon its own data centers. This was a big idea since Amazon just rolled out the service two years before, and it wasn't yet what anybody would call mature. The Netflix precipitating event was a database failure that prevented the company from delivering DVDs to their customers for three days. 

(SOUNDBITE OF FILM, "SPACEBALLS") 

John Candy: (As Barf) That's going to leave a mark. 

Rick Howard: That Christmas in 2008, AWS suffered another major outage that prevented Netflix customers from using the new streaming service. In response to these incidents, Netflix engineers developed their first chaos engineering product in 2010 called Chaos Monkey. That helped them counter the vanishing instance problem caused by the AWS outage. With that success in their pocket, Netflix began building their own chaos engineering team and wondered if they could scale the process. If they could fix the small-scale vanishing instances problem, could they do the same at the vanishing region scale? 

Rick Howard: In fairness, Netflix wasn't the only company thinking along these lines. Back in 2006, Google site reliability engineers established their own disaster recovering testing program, or DiRT, to intentionally insert failures into their internal systems to discover unknown risk. But their cool name for it, DiRT, wasn't as hip as the Netflix name Chaos Monkey, and it didn't catch on. The idea was similar, though. 

Rick Howard: By 2011, Netflix began adding new failure modules that provided a more complete suite of resilience features. These modules eventually became known as the Netflix Simian Army and included colorful names like Latency Monkey, Conformity Monkey and Darker Monkey, just to name three. There are many more. Netflix shared the source code for Chaos Monkey on GitHub in 2012. And by 2013, other organizations started playing with the idea. By 2014, Netflix created a new employee role, chaos engineer, and began working on ideas of reducing the blast radius of plan-injected failures. By 2016, Netflix had an entire team of chaos engineers working on the Simian Army. By this time, though, there was a small but growing contingent of companies experimenting with the idea, too, like Capital One, Google, Slack, Microsoft and LinkedIn. 

Rick Howard: So what does chaos engineering have to do with automation and resilience, two cybersecurity first principles? I'm glad you asked. Traditionally, linear regression tests, SRE and development teams and IT resilience have generally been the purview of the CIO. There are definite divisions of labor for resilience, though. The CIO is handling the DevOps piece, and the CSO needs to be part of the team. But I'm making the case here that chaos engineering is something that should be owned by the CSO. Who better to discover potential unknown systemic failures that might impact production or the ability to recover from an event quickly? The CIO handles the known stuff, but in terms of first principles, the CSO's job description should be to discover unknown faults in the system that will cause material damage. 

Rick Howard: According to Rinehart and Shortridge in their book "Security Chaos Engineering," traditional security programs orbit around failure avoidance. Infosec teams design and implement people, process and technology policy designed to prevent the organization from getting anywhere near a disaster. In contrast, they say that failure is where an infosec team learns the most, and I agree. If you can build these small experiments that uncover potential systemic failure, that might be the most valuable thing an infosec team does. Rinehart and Shortridge say that this mindset changes the infosec team's focus away from building a purely defensive posture and towards something that is more adaptive. Instead of seeking defensive perfection, pursue the ability to handle failure gracefully. And if that sounds familiar, it's pretty close to the definition I've been using for resilience that I got from the Swedish academic team of Bjorck, Henkel, Stirna and Zdravkovic - quote, "the ability to continuously deliver the intended outcome despite adverse cyber events," end quote. It also implies that, specifically at this scale, this graceful handling of failure will be handled at the infrastructure as code level. 

Rick Howard: Rinehart and Shortridge recommend that the infosec community move away from security theater, a concept made famous by one of cybersecurity's renowned thought leaders, Bruce Schneier - an old boss of mine, by the way. This is the idea that infosec teams perform work that creates the perception of improved security, but, in reality, don't add much. One example of this could be the purchase of an anti-phishing product that delivers approved phishing email messages to employees to train them not to click bad URLs. Or another is building an insider threat program designed to prevent employees from taking their old PowerPoint slides with them to their next job. In the big scheme of things, are those kinds of security theater programs as impactful as, say, discovering a previously unknown fault in the organization's system design that could cause catastrophic failure? The notion is worth considering. 

Rick Howard: Specifically with respect to traditional security, however, Rinehart and Shortridge suggest that you could apply chaos engineering to other more typical security functions like red teaming. Instead of turning loose the red team to find some hole in the defensive posture, we could instead develop a hypothesis around how the organization should react to a specific attack sequence - for example, Wicked Panda. If we treat red teaming exercises as a science experiment with a hypothesis that defines how we think the organization will react to a Wicked Panda attack, we might learn something new. If that's true, we could expand this kind of thinking to all sorts of traditional security tasks like container security, CI/CD pipeline security, security monitoring, incident response and so forth. You might say you're already doing those things, and I don't need any of your help, Rick. But what I'm suggesting is a subtle shift away from rudimentary tests of the system with things we already know about, and towards a more advanced scientific method designed to uncover the things we don't already know. 

Rick Howard: With all that said, the buccaneering spirit of chaos engineering is not for everybody. It's another tactic that we might use to reduce the probability of material impact due to a cyber event. It's another arrow in our quiver to build our resilience program alongside the other arrows of crisis planning, incident response, backups and encryption. The concept is probably a bridge too far for most small- to medium-sized organizations who struggle to find resources just to keep the lights on. But for big Silicon Valley companies that deliver services around the world - the Netflixes, the Googles and the LinkedIns, for example - and for most Fortune 500 companies, chaos engineering is something to consider. Indeed, many of these companies are probably already on this path. 

Rick Howard: And that's a wrap. Next week, we're going to tackle the tricky subject of practical cyberthreat intelligence and how to incorporate open-source intelligence into the first principle infosec machine. You don't want to miss that. The CyberWire's "CSO Perspectives" is edited by John Petrik and executive produced by Peter Kilpe. Our theme song is by Blue Dot Sessions remixed by the insanely talented Elliott Peltzman who also does the show's mixing, sound design and original score. And I'm Rick Howard. Thanks for listening.