Metrics and risk: All models are wrong, some are useful.

By Rick Howard

Apr 27, 2020

CSO Perspectives is a weekly column and podcast where Rick Howard discusses the ideas, strategies and technologies that senior cybersecurity executives wrestle with on a daily basis.

Metrics and risk: All models are wrong, some are useful.

Listen to the podcast version of this story.

I love metrics; always have. I have been a fan of collecting metrics since I started in the IT business back in the internet dinosaur days. I didn’t have some compelling scientific reason to collect them. I just had this vague unease that I felt blindly running my IT systems without having some indicators to look at to see if my systems were healthy or not. As it turns out, this wasn’t an exact science when I started in the biz, and even today, it’s still a bit of a mystery.

An old war story: failure of an Exchange server.

I remember when I first discovered the Windows’ Operating System program called "System Monitor” sometime in the late 1990s. I was responsible for an exchange server farm on an Army post; about 15 servers that provided email services to all personnel. The “System Monitor” program was so powerful that you could slice and dice thousands of different variables that had something to do with how the Exchange mail system worked. Unfortunately, Microsoft was fairly silent about which variables were important and which were not. One day, in my zeal for metrics collection, I turned all of the variables on and started to piece together what I thought was valuable information about the Exchange system.

And then I executed the typical Howard move. I got distracted. Something came up and I walked away from the server farm without turning off all of the collection. “System Monitor” wrote all of that data to the local exchange server’s hard drive space. Sometime in the wee hours of the next morning, the data center admins called to inform me that the exchange server system had completely failed. Apparently, my System Monitor efforts had filled up the Exchange server hard drive space and crashed the entire system. It took us 10 days to get it up and running again. Army posts are kind of like small towns; everyone knows everyone. For that 10 days, I was the most hated man on the base. Nobody could get any work done.

Much later, long after I wasn’t too embarrassed to show my face in public again, I discovered a new fact about email. People hate it. After it was all said and done and blame had been properly distributed, colonels and privates alike came up to me privately to thank me for that 10 days of bliss. They all said they got more things done during that 10 days without the “Evil” email system running than they had the entire rest of the year.

I’m going to call that a win. When you think about it, I was kind of a hero, at least for those 10 days. I probably should have gotten a medal for that or something. At least that’s how I plan to remember it.

Some useful cybersecurity metrics.

When freelance security writer Mary Pratt wrote an essay for CSO magazine not too long ago claiming that she knew exactly which cybersecurity metrics mattered and which ones did not, I was intrigued. Perhaps the science had improved enough to know precisely what to collect and monitor. And reading through the essay, Mary did pick some good metrics to pay attention to:

Results of simulated phishing attacks.
Mean time to recover from a cyber attack.
Mean time to detect a cyber attack.
Penetration testing successes.
Vulnerability management patching status.
Enterprise security audits against a standard framework.

These are all good things to track. I would add another one. A previous boss of mine tasked me to track how many people I was using to respond to cyber incidents. He told me that if that number was going up year over year, that was the wrong direction. He said that I should be automating my processes as much as possible in order to reduce the number of people needed, that the correct solution was not to throw more bodies at the problem. And he was right.

But even with those words of wisdom, counting these kinds of things is not the goal. They are a means to a goal. As I said, they’re indicators. But what are they indicators of? In a tactical sense, they are indicators of the health and efficiency of the system. But, as Mary pointed out in her piece by highlighting discussions with various company executives, the tactical stuff is for the CISOs, not for the company leadership team and definitely not for the board members. Why would a board member care what the mean time to recovery is? How would they even know what “good” is when they saw it? At best they would recognize that improving that time each quarter is forward progress, but how would they judge when it was good enough?

And besides, this kind of thing is not their world. Do you know what is in their world? Risk.

Risk: Heat maps versus useful models.

Senior executives juggle risk all the time. It’s kind of their job description, and they do it daily across an enormous collection of disciplines like personnel management, supply chain, product management, marketing, and investment opportunities, to name just a few. But sometime in the early days of internet security, say the late 1990s, the network defender community decided that cybersecurity risk was too hard to convey to senior leadership. We decided that it was much easier to use fear, uncertainty, and doubt (or FUD) to scare the hell out of decision makers in order to get the budget for our pet security projects.

I admit it. I did this myself in the early days. I used these charts called “heat maps” where I plotted all the cyber bad things that were likely to happen to the company on the X-Axis and how impactful they would be if they did happen along the Y-axis. The really bad things would float high and to the right of the chart. The more benign things would float low and to the left. And since I knew my way around a spreadsheet, I would color-code the entries. The high and to the right stuff would be red, the middle stuff would be yellow, and the benign stuff would be green. The heat map looked as if cyber risks were warming up in an oven from left to right and from bottom to top. Once complete, I would walk this chart into a board meeting, point to the highest point on the heat map, say scary things about what the highest point meant, and then ask for a gazillion dollars to fund my pet security project. Sometimes that worked for me, and sometimes not.

My batting average for success notwithstanding, the problems with that approach are twofold. First, experts have produced reams of research showing that heat maps are just bad science. There are not just one or two papers on the subject. There are tons of them, and the reasons cited for this bad science conclusion are plentiful. In general though, heat maps are not precise enough. They are based on qualitative assessments using some version of a high-medium-low score. And one problem with this qualitative assessment is that even if I define precisely what a high score is, the person seeing the data will have their own opinion of what “high" means regardless of what I tell them. When they see the score, they will think it means something different from what I intended the score to mean.

Second, heat maps never give the decision-maker a chance to say whether or not the risk is acceptable. Indeed, they highlight scary things, but give no measure as to how risky those scary things will be to the business. I know that’s hard for network defenders to swallow. We normally assume that any cyber risk is bad, and needs to be eliminated. But that just isn’t true. Company leaders assess all kinds of risk while running a business. They are continually weighing the pros and cons of various actions. Cyber risk is not different from all of these other kinds of risks. Cyber risk is just another risk that leadership needs to weigh.

Measuring the probability of risk.

Over time, I have come to realize that in order to convey any risk, but especially cyber risk, you must know three elements. The first is probability. What is the probability that a cyber event will happen? This is a quantitative mathematical number between 0 and 100; not a high-medium-low estimate. It needs to be precise. The second component is materiality. If a cyber event happens, will it be material to the business. Not everything in cyber land will be material. If the company's website is defaced, I will just re-image the server and relaunch it. But if my proprietary code library is stolen by my competitors, that would be a significant emotional event for the company. That would be material. The last risk component is time. It can’t be that there is a high probability for a material cyber event sometime in the future. Of course that’s true. If you wait long enough, something is bound to happen. The probability for that is almost 100%. But if you time-bound the question to say three years or five years, or whatever makes sense for your organization, that probability will in all likelihood be much lower.

The question we need to answer for the board, then, is this: what is the probability that a cyber event will materially impact the company in the next three years? Answer that question and then board members can decide if they’re comfortable with that level of risk or if they need to invest in people, process, or technology to reduce it.

So how do you do that? How do you measure risk with any kind of precision so that you can assign a probability to it? That’s where the metrics come in. Let’s start with the list that Mary presented in her essay. They’re as good as any to begin with. By themselves, though, none give us the answer. They are tactical indicators of the system’s health. But if you presented these numbers to your InfoSec team:

Phishing Exercise: Last Quarter 5% of employees clicked the link compared to 7% from last quarter.
Mean Time to Recover from a Cyber Attack: four days; down from two weeks the same time last year.
Mean Time to Detect a Cyber Attack: 100 days; down from 382 days the same time last year.
Penetration Tests: This quarter the Pen Test Contractor was able to steal the CEOs credentials.
Vulnerability management: System is 80% patched vs 37% patched the same time last year.
NIST Framework: Level 2 on most NIST Framework elements vs Level 1 on most NIST Framework Elements the same time last year.
Incident Response Team: Five people on the team; five people on the team the same time last year.

...that Infosec team would clearly be mostly pleased with the improvement of their internal security program. The theft of the CEOs credentials will cause some anxiety. We would want to fix that issue. But could they answer this three-part risk question?

What is the probability that a cyber event will materially impact the company in the next three years?

If you gathered the entire InfoSec team into a conference room, had them review the current set of metrics, and told them to estimate that probability so that they were 95% confident in their answer, could they do it? Of course they could. Using these metrics as a baseline is a pretty simple model to estimate your first probability, but your team could do it. As George Box, the famous British statistician has said, “All models are wrong, Some are useful.” This is a simple but useful model.

You can absolutely go down the rabbit hole building more complex models using cost projections, Monte Carlo simulations, latency curves and other things that have something to do with math. Below is a list of reference books and papers that will help you do that. But the bottom line is that with this first step, you now have something that you can take to the board; you now have a simple but precise estimate of the risk that your organization might be materially hacked in the near future.

Your risk culture may vary.

The key thing about risk, and what you might do about it, is absolutely tied to the culture of the organization that’s considering the question. Let’s say that your InfoSec team said there was a 20% chance of being materially impacted by a cyber event in the next three years. Some reading this might say that was unacceptable, and that we need to reduce that number by quite a bit. Others would say that 20% compared to other risks the business is dealing with seams reasonable. They’d be willing to eat that risk and deal with the consequences if something happened later. Both could be correct answers depending on the leadership’s risk appetite.

In my younger days, I used to think that if I failed to convince company leadership to fund my pet security project, it was because they were too dumb to understand the intricacies of cybersecurity. They were typical non-techies and they just didn’t get it. In hindsight, that was terribly naive. The most likely reason that I failed was probably that I did not do a good job convincing them of the risk. The second most likely reason was that even if they believed me, they considered the risk to be acceptable. It took me a long time to understand that. It took me a long time to understand the significance of the fact that company leaders deal with this kind of thing all the time; that cyber risk is just another risk in the hundreds that executives have to consider as they shepherd their companies toward success. Security executives can help them do that by evaluating and conveying the risks to those company leaders in a way that they can understand. The first step is building your first model with the security metrics that you have already collected. Over time, enhance that model with better metrics and better math. Just don’t forget to turn off the data collection so that you don’t overrun your email system.

I still think I should have received a medal for that.