CSO Perspectives (Pro) 6.1.20
Ep 9 | 6.1.20

Resilience: a first principle of cybersecurity.


Rick Howard: In 2011, as Netflix moved their support infrastructure from on-prem to the cloud, the Netflix engineers built the now-famous application module called Chaos Monkey. First, how great is that name? I love network engineers. Second, here is what their website says about it. Quote, "Chaos Monkey is a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact," end quote. Let me say that another way. Netflix routinely runs an app that randomly destroys pieces of their customer-facing infrastructure on purpose so that their network architects understand resilience engineering down deep in their core. 

Rick Howard: Now, there are some network defenders and IT professionals who would categorize what Netflix does as impressive, aspirational even. But I believe that the bulk of us would categorize what Netflix does as stark raving bonkers. We're not going to bring down our customer-facing infrastructure for a test. It's hard enough to keep the thing up and running without destroying it ourselves. We would be wrong, of course, but that's the current thinking in our community. Netflix has embraced resiliency in its IT infrastructure. The bulk of the rest of us just wave our hands at it. 

Rick Howard: My name is Rick Howard. You are listening to "CSO Perspectives," my podcast about the ideas, strategies and technologies that senior security executives wrestle with on a daily basis. This is the fourth show in a planned series that discusses the development of a general purpose cybersecurity strategy using the concept of first principles. I explained what first principles are during the first show and made an argument what the very first principle should be. I discussed zero trust in the second and intrusion kill chains in the third. For this show, we are talking about resilience. 

Rick Howard: With all of these concepts, we are building a metaphorical wall, brick by brick, for a cybersecurity infosec program based on first principles. The foundation of that wall, the ultimate and atomic first principle, is this. Reduce the probability of material impact to my organization due to a cyber event. That's it. Nothing else matters. This simple statement is the pillar on which we can build an entire infosec program. Zero trust and intrusion kill chains are key bricks on that wall. They are both necessary. But together, they are not sufficient. Even if you have been wildly successful implementing these two strategies, that achievement does not guarantee that your cyber-adversaries will not cause you material damage. No defense is perfect. Just talk to the French about their Maginot Line defensive failure during World War II or, for my fellow nerds out there, cringe at how Jon Snow's defensive plan to defend Castle Winterfell disintegrates from the assault of the White Walkers in "Game Of Thrones." With robust zero trust and kill chain strategies in place, the infosec team can greatly reduce the probability of material impact due to a cyber event. But there is one more lever to pull that can add leverage to that probability. It's called resilience. 

Rick Howard: As a concept, the American Society for Industrial Security - or ASIS for short - coined the phrase as early as 2009. But they were really describing what turned out to be business continuity. I'll cover the difference between the two in a bit. The World Economic Forum formalized resilience back in 2012. Quote, "the ability of systems and organizations to withstand cyber events," end quote. Since then, other thought leaders have refined it. U.S. President Obama even signed a presidential policy directive dictating resilience for the country's critical infrastructure back in 2013. But the definition I like best comes from two Stockholm University researchers in 2015. I'm going to butcher these names, but Janis Stirna and Jelena Zdravkovic. Ooh, that sounded pretty good. They define it this way. Quote, "the ability to continuously deliver the intended outcome despite adverse cyber events," end quote. And finally, the International Standards Organization, ISO, defined it as this in 2017. Quote, "the ability of an organization to absorb and adapt in a changing environment to enable it to deliver its objectives and to survive and prosper," end quote. In other words, assume that the bad guys will successfully negotiate the intrusion kill chain or find a chink in my zero-trust armor. Or just in general, assume that there will be a massive IT failure sometime in the future. Devise a strategy that will ensure that your organization's essential services will still function. 

Rick Howard: My favorite real-world deployment example of resilience and possibly the world's greatest demonstration of nerd chutzpah is the chaos engineering project over at Netflix. They have a series of applications they call the Simian Army that, on purpose, collapses random pieces of their infrastructure to test for resilience. These apps have fabulous names like Chaos Monkey, Latency Monkey, Dr. Monkey, Security Monkey and, of course, Chaos Gorilla, just to name a few. In my typical world, disasters are things that might happen sometime in the future but probably never. At least, I hope they don't. I have plans written on paper that discusses what we might do if a disaster happens, but that's usually as far as it goes. In the Netflix world, planned disasters happen every day, and I still get to keep watching episodes of "The Witcher" uninterrupted as if nothing happened. 

Rick Howard: The other resiliency example I like to talk about is the Google Site Reliability Engineering teams, or SREs. Back in 2004, when Google was nothing more than a search engine, the Google leadership team made an extraordinary decision. Instead of creating a team of network engineers to manage the infrastructure the way that every other company on the planet did, they handed the responsibility to the software development team. Now, the domino effect you get when you hand a task like that to a bunch of programmers is that it fertilizes the seeds for an internet giant to emerge down the road. At the start, the SREs wrote programs to automate those jobs that a network engineer would traditionally do by manually logging into a console and typing commands. They invented DevOps a full six years before the industry even had a name for it. Over time, that monumental decision pioneered the idea of infrastructure as code. 

Rick Howard: SREs label manual tasks as toil, and they describe it as anything repetitive, tactical and devoid of any enduring value. Now, we all know the benefits of automating tasks, but the Google SREs have taken that idea to the nth degree. Now, they realize that it's not a panacea, but it is a force multiplier. Done correctly, it layers a blanket of consistency across the entire organization. And once built, the emerging platform can be easily extended. Google didn't just automate key tasks - they built an autonomous system that instantiates a framework for resiliency. In my personal experience, I can't remember the last time a Google product failed. But you know that internally, their systems are failing all over the place. The infrastructure is too big for that not to be true. The fact that I never noticed meets the very definition of resiliency. 

Rick Howard: Both the Netflix and Google examples are more in line with IT operations than security. They're more DevOps than DevSecOps. That's unfortunate, but the SREs of the world have set a great example for the security community - design and deploy the digital infrastructure in such a way that even if Fancy Bear penetrated the deployed defensive system, the impact to the organization would be minimal. Design it so that even if the Ragnar Locker ransomware takes over a segment of my network, my business can continue to provide service. That's resilience. 

Rick Howard: You may be saying to yourself that this resilience thing sure sounds like an older idea that's been around for a long time. It's called business continuity. So what's the difference? Well, it turns out that business continuity got its start in the 1970s, and that community is really large. And some are upset, thinking the newfangled marketing term resilience is just the latest buzzword in the industry that is getting all of the attention but that the two phrases are interchangeable. That's not quite true, but I don't want to go down that rabbit hole of internet debate here. For simplicity's sake, think of resilience as the strategy and business continuity as the set of tactics organizations use to achieve that strategy. 

Rick Howard: From my own perspective, though, the business continuity people have stayed mostly in the physical world, concentrating on keeping the business running in lieu of natural disasters, power outages, executive deaths and infestation of White Walkers - you know, things like that. You don't traditionally see a lot of business continuity people advocating for a Netflix Chaos Monkey approach or a Google infrastructure as code approach. 

Rick Howard: All of that is unimportant, though. It is arguing over semantics. The metaphorical infosec wall we are building based on first principles will reduce the probability of material impact to our organization due to a cyber event. The bricks that we are putting on that wall are, first, zero trust then the intrusion kill chains and now our resilience. If you don't like the word resilience, though, call it what you want. You can call it the Great Gazoo if you like as long as it continuously delivers the intended business outcome despite any adverse cyber events. 

Rick Howard: Some would say that resilience sounds similar to zero trust, too. What's the difference between those two? Well, zero trust is limiting access to organizational resources based on need to know. Resilience means that even though you have built-in limited access, you might have redundant workloads running in hybrid cloud environments that the IT team and the infosec team can switch to in case of an emergency. That might be subtle, but it's a major difference. 

Rick Howard: Our resilience block that sits at the base of our metaphorical infosec wall is the strategy that will allow our organization continue to function during a catastrophic cyber event like an OPM-level breach or an Edward Snowden-type insider threat event. It's one more lever to pull in our pursuit of reducing the probability of a material impact to our organization due to a cyber event. Look to the Netflixes and the Googles of the world to get inspired about how to do it. Team up with the business continuity teams, and bring them along for the ride. They have a lot of practical how-to knowledge that will be useful. If you can get all of this done, maybe your castle Winterfell won't get overwhelmed by the hacker White Walkers. 

Rick Howard: And that's a wrap. If you agree or disagree with anything I have said, hit me up on LinkedIn or Twitter and we can continue the conversation there. 

Rick Howard: The CyberWire's "CSO Perspectives" is edited by John Petrik and executive produced by Peter Kilpe; engineering, music design and original music all done by the insanely talented Elliot Peltzman. And I am Rick Howard. Thanks for listening.