Cybersecurity first principles: resilience.

By Rick Howard

Jun 1, 2020

CSO Perspectives is a weekly column and podcast where Rick Howard discusses the ideas, strategies and technologies that senior cybersecurity executives wrestle with on a daily basis.

Cybersecurity first principles: resilience.

Listen to the podcast episode.

Note: This is the fourth essay in a planned series that discusses the development of a general purpose cybersecurity strategy for all network defender practitioners-- be they from the commercial sector, government enterprise, or academic institutions-- using the concept of first principles. The first essay explained what first principles are in general and what the very first principle should be for any infosec program. The second essay discussed zero trust. This third essay covered intrusion kill chains. This essay will cover resilience.

We are building a strategy wall, brick by brick, for a cyber security infosec program based on first principles. The foundation of that wall is the ultimate and atomic first principle:

Reduce the probability of material impact to my organization due to a cyber event.

That’s it. Nothing else matters. This simple statement is the pillar on which we can build an entire infosec program. The first two bricks we put on that pillar were zero trust and intrusion kill chains. They are both necessary, but together they are not sufficient. Even if you have been wildly successful with implementing these two strategies, that achievement does not guarantee that your cyber adversaries will not cause you material damage. No defense is perfect. Just talk to the French about their Maginot Line defense failure during World War II or, for my fellow nerds out there, cringe at how Jon Snow’s defensive Plan to defend Winterfell disintegrates from the assault of the White Walkers in Game of Thrones. With robust zero trust and kill chain strategies in place, the infosec team can greatly reduce the probability of material impact due to a cyber event, but there is one more lever to pull that can add leverage to that probability. It’s called resilience.

Resilience defined.

As a concept, ASIS International coined the phrase as early as 2009, but they were really describing what turned out to be business continuity. I will cover the difference between the two further down. The World Economic Forum formalized resilience back in 2012.

“… the ability of systems and organizations to withstand cyber events …”

Since then, other thought leaders have refined it. US President Obama even signed a presidential policy directive dictating resilience for the country’s critical infrastructure back in 2013. But the definition I like best comes from two Stockholm University researchers in 2015. Janis Stirna and Jelena Zdravkovic define it this way:

“… the ability to continuously deliver the intended outcome despite adverse cyber events.”

And finally, the International Standards Organization (ISO) defined it as this in 2017:

“... the ability of an organization to absorb and adapt in a changing environment to enable it to deliver its objectives and to survive and prosper.”

In other words, assume that the bad guys will be successful negotiating the intrusion kill chain, or find a chink in my zero-trust armor, or, just in general, assume that there will be a massive IT failure sometime in the future. Devise a strategy that will ensure that your organization's essential services will still function.

Resilience: examples.

My favorite example of a practical implementation of resilience is what the people at Netflix call chaos engineering. In 2011, as they moved their support infrastructure from on-prem to the cloud, the Netflix engineers built their first module called “Chaos Monkey.” This is what their website says about it.

“Chaos Monkey is a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact.”

Let me say that another way. Netflix routinely runs an app that randomly destroys pieces of their customer facing infrastructure, on purpose, so that their network architects understand resilience engineering down deep in their core. In my typical world, disasters are things that might happen sometime in the future but probably never. At least I hope that they don't. I have plans written on paper that discuss what we might do if a disaster happens, but that’s usually as far as it goes. In the Netflix world, planned disasters happen every day, and I still get to keep watching episodes of “The Witcher” uninterrupted as if nothing happened. Since they deployed the original chaos monkey module, the Netflix team has built an entire series of chaos tools designed to increase their confidence that they will not only survive a catastrophic event, but that their customers will not even notice that the Netflix infrastructure is going through one.

There are some network defenders and IT professionals who would categorize what Netflix does as impressive, aspirational even. But I believe that the bulk of us would categorize what Netflix does as stark raving bonkers. We’re not going to bring down our customer facing infrastructure for a test. It’s hard enough to keep the thing up and running without destroying it ourselves. We would be wrong, but that’s the current thinking in our community. Netflix has embraced resiliency in its IT Infrastructure. The bulk of the rest of us just wave our hands at it.

The other resiliency example I like to talk about is the Google Site Reliability Engineer teams, or SREs. Back in 2004, when Google was nothing more than a search engine, the Google leadership team made an extraordinary decision. Instead of creating a team of network engineers to manage the infrastructure the way every other company on the planet did, they handed the responsibility to the software development team. The domino effect you get when you hand a task like that to a bunch of programmers is that it fertilizes the seeds for an Internet tech giant to emerge down the road. At the start, the SREs wrote programs to automate those jobs that a network engineer would traditionally do by manually logging into a console and typing commands. They invented DevOps a full six years before the industry even had a name for it. Over time, that monumental decision pioneered the idea of infrastructure as code.

SREs label manual tasks as “toil,” and they describe toil as anything repetitive, tactical and devoid of any enduring value. We all know the benefits of automating tasks, but the Google SREs have taken it to the nth degree. They realize that it is not a panacea, but it is a force multiplier. Done correctly, it layers a blanket of consistency across the entire organization and, once built, the emerging platform can be easily extended. Google didn’t just automate key tasks. They built an autonomous system that instantiates a framework for resiliency. In my personal experience, I can’t remember the last time a Google product failed. But you know that internally, their systems are failing all over the place. The infrastructure is too big for that not to be true. The fact that I never noticed meets the very definition of resiliency.

IT resilience and infosec resilience.

Both the Netflix and the Google examples are more aligned with IT operations than security: they’re more devops than devsecops. That’s unfortunate. But the SREs of the world have set a good example for the security community. Design and deploy the digital infrastructure in such a way that even if Fancy Bear penetrated the deployed defensive system, the impact to the organization would be minimal. Design it so that even if the RAGNARLOCKER ransomware takes over a segment of my network, my business can continue to provide service. That’s resilience.

Resilience distinguished from business continuity.

So what is the difference between resilience and business continuity? It turns out that business continuity got its start in the 1970s, and that community is quite large. Many are upset, thinking the new fangled marketing term, “resilience,” is just the latest buzz word in the industry that is getting all of the attention, but that the two phrases are interchangeable. That’s not quite true, but I don’t want to go down the rabbit hole of that particular Internet debate here. For simplicity’s sake, think of resilience as the strategy, and business continuity as the set of tactics organizations use to achieve that strategy. From my perspective though, the business continuity people have stayed mostly in the physical world, concentrating on keeping the business running in lieu of natural disasters, power outages, executive deaths, an infestation of white walkers, etc. You don’t traditionally see a lot of business continuity people advocating for a Netflix chaos monkey approach or a Google infrastructure as code approach.

All of that is unimportant though. It is arguing over semantics. This metaphorical infosec wall we are building based on first principles will reduce the probability of a material impact to our organization due to a cyber event. The bricks that we’ll put on that wall are, first, zero trust, then the intrusion kill chains, and now resilience. If you don’t like the word “resilience,” though, call it what you want. You can call it the Great Gazoo if you like as long as it continuously delivers the intended business outcome despite any adverse cyber events.

Some would say that resilience sounds similar to zero trust. What’s the difference? Zero trust is limiting access to organizational resources based on need-to-know. Resilience means that even though you have built in limited access, you might have redundant workloads running in hybrid cloud environments that the IT team and the infosec team can switch to in case of an emergency. That might be subtle, but it’s a major difference.

Our resilience block that sits at the base of our metaphorical infosec wall is the strategy that will allow our organization to continue to function during a catastrophic cyber event like an OPM level breach or an Edward Snowden type insider threat event. It’s one more lever to pull in our pursuit of reducing the probability of a material impact to our organization due to a cyber event. Look to the Netflixes and the Googles of the world to get inspired about how to do it. Team up with the business continuity teams and bring them along for the ride. They have a lot of practical how-to knowledge that will be useful. If you can get all of this done, maybe your Castle Winterfell won’t get overwhelmed by the hacker White Walkers.

Cybersecurity first principles: resilience.

Resilience defined.

Resilience: examples.

IT resilience and infosec resilience.

Resilience distinguished from business continuity.

Recommended reading.