Bayes Rule: A different way to think about cybersecurity risk.
Rick Howard: Hey, everybody. Rick here. In the last show I talked about superforecasters from Dr. Tetlock's book of the same name. He makes the case - and I agree with him - that it's possible to forecast answers to highly complex questions - queries that seemingly no one could possibly answer because there are no prior data or history of occurrence - with enough accuracy to make meaningful decisions in the real world. Specifically, I believe we can use his superforecasting techniques to estimate the probability of material impact to our own organizations due to a cyberattack. From Tetlock's book, I call one of those techniques the Fermi outside-in back-of-the-envelope calculation. Now, that's a long phrase that means basically informed guessing. As you recall, Dr. Enrico Fermi was famous for his ability to estimate the results of real-world experiments by breaking the problem down into smaller and smaller, more manageable problems, which he could reasonably guess the answer to. His calculations weren't precise, but they were close enough to be in the same ballpark as the real answer.
Rick Howard: I made the case in the last show that for most cybersecurity resource decisions about where to spend money, where to allocate people and where to refine the business process, security practitioners don't need to have high-precision answers. Fermi estimates are probably good enough. And it's not an either-or decision. In those unusual cases where you absolutely need the precise answers, you can still do the work to find them. Superforecasting techniques in general - and specifically Fermi outside-in back-of-the-envelope calculations - are two legs to the cybersecurity risk forecasting stool.
Rick Howard: The third leg is something called Bayes' rule. And it's the mathematical foundation that proves that superforecasting techniques and Fermi estimates work. The great news, though, is that CSOs like me don't have to perform higher-order math to make it work for us. This is what superforecasters do. For cybersecurity, we can use basic statistics in the general case and expert opinion from our internal staff to get an initial estimate. We can then modify the forecast based on how well our organizations do in adhering to our set of cybersecurity first principles. Before I can show you how to do that, though, let me first explain Bayes' theorem.
Rick Howard: My name is Rick Howard, and I'm broadcasting from the CyberWire's Secret Sanctum Sanctorum Studios, located underwater somewhere along the Patapsco River near Baltimore Harbor, Md., in the good ol' U.S. of A. And you're listening to "CSO Perspectives," my podcast about the ideas, strategies and technologies that senior security executives wrestle with on a daily basis.
Rick Howard: If you recall from previous shows that I have done about cyber risk, I used to think that in order to solve the cybersecurity risk forecasting problem, I was going to have to know how to perform some higher-order math, like implementing Bayes' algorithm and running Monte Carlo simulations with Markov chains. But after reading Tetlock's superforecasting book, I realized that we might have to do that sometimes, but for most cases, we just have to understand how the Bayes algorithm works and use some Fermi best guesses to get a ballpark number. So let's take a look at Bayes' algorithm.
Rick Howard: The Bayesian interpretation of probabilities comes from Thomas Bayes, who penned the original thesis back in the 1740s. But what is not commonly known is that nobody would have heard about the idea if it weren't for his best friend, Richard Price. Price, no slouch in the science department himself, found Bayes' unpublished manuscript in a drawer after Bayes died, realized its importance, spent two years fixing it up and sent it to the Royal Society of London for publication in 1763. In the manuscript, Bayes - with Price - describes a thought experiment to illustrate his hypothesis.
Rick Howard: Bayes asks the reader to imagine a billiard table and two people - the guesser and the assistant. The guesser turns her back, and the assistant rolls the cue ball onto the table and lets it settle somewhere. The guesser's job is to forecast where the cue ball is located on the flat surface. She has a piece of paper, a pencil and draws a rectangle to represent the level platform. The assistant then rolls a second ball and tells the guesser if it's settled to the right or to the left of the cue ball. The guesser makes an initial estimate on the paper as to which side of the table the cue ball resides. The assistant then rolls a third ball and tells the guesser which side of the cue ball it landed. Based on that information, the guesser adjusts her initial estimate. The more balls the assistant rolls, the more precise the guesser gets with her forecast. She will never know exactly where the cue ball is, but she can get fairly close. This, in essence, is Bayes' thesis. We can have an initial estimate of the answer no matter how broad that might be - somewhere on the billiard table - and gradually collect new evidence - right or left of the cue ball - that allows us to adjust the initial estimate to get closer to the real answer.
Rick Howard: According to Sharon McGrayne - author of the 2011 book "The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy," - quote, "by updating our initial beliefs with objective new information, we get a new and improved belief." She says that Bayes is a measure of belief, and it says that we can learn "even from missing and inadequate data, from approximations and from ignorance," end quote. Even though Bayes was a mathematician, he didn't work out the actual probabilistic formula called Bayes' rule that is used today. That didn't come until Pierre Laplace - the French mathematician, astronomer and physicist who was best known for his investigations into the stability of the solar system and his discovery of the central limit theorem - identified independently from Bayes in 1774 the same idea and spent the next 40 years working out the math. Today we attribute Bayes' theorem to Thomas Bayes because of scientific convention. He was the first to come up with the idea. But in reality, according to McGrayne, we should call it the Bayes-Price-Laplace algorithm. Without Price and Laplace, Bayes' theorem would never have seen the light of day.
Rick Howard: Modern Bayesian scientists use words like the prior to represent the initial estimate - the cue ball by itself on the table - the likelihood to represent the probability of the new information we are receiving - where is the cue ball in relation to the second ball? - and the posterior to represent the new estimate after we combine the prior and the likelihood in Bayes' theorem. According to McGrayne, quote, "each time the system is recalculated, the posterior becomes the prior of the new iteration," end-quote. That is an elegant idea, but the scientific community completely rejected the thesis after the Royal Society published Bayes' manuscript. You have to remember that at the time, science was moving away from religious dogma as a way to describe the world. These new scientist-statisticians, called the frequentists, were basing everything on observable facts. They had to count things, like the number of cards in a deck, before they would feel comfortable predicting the odds of an ace showing up on the flop. The idea that you could brand Bayes' fuzzy estimates as science without observable facts was anathema, and leading statisticians attacked it at every opportunity for the next 150 years. To them, modern science required both objectivity and past knowledge. According to Hubbard and Seiersen in their book "How to Measure Anything in Cybersecurity Risk," Gottfried Achenwall introduced the word statistics in 1749, derived from the Latin word statisticum, meaning pertaining to the state. Statistics was literally the quantitative study of the state. Back in 2011, McGrayne gave a Google talk about her book. Here's what she says about the frequentist view.
(SOUNDBITE OF ARCHIVED RECORDING)
Sharon Mcgrayne: Because for them, modern science requires both objectivity and precise answers. And Bayes, of course, calls for a measure of belief and approximations. And the frequentists called that, quote, "subjectivity run amok, ignorance coined into science." By the 1920s, they were saying that Bayes smacked of astrology, of alchemy.
Rick Howard: But the real world has problems where the data is scant. Leaders worry about potential events that have never happened but are possible, like a future ransomware attack. Bayesian philosophy was a way to estimate answers that were useful in the real world, and outsiders to the statistical community began experimenting with the method to attain real results. Amazingly, after 280 years, Bayes' rule still meets with friction in some circles of the scientific community. There still seems to be an attitude that it's an either/or choice - either you're a frequentist or a Bayesian. That's a shame because like Euclid's first principle math rules, the Bayesian rule is true because it works. In the 2012 movie "Lincoln," starring Academy Award winner Daniel Day Lewis as the 16th president, in one poignant scene late at night, Lincoln is explaining to two young signal officers, but mostly to himself, why slavery is bad, that the notion of it being bad is self-evident.
(SOUNDBITE OF FILM, "LINCOLN")
Daniel Day-lewis: (As Abraham Lincoln) You're an engineer. You must know Euclid's axioms and common notions.
Unidentified Actor: (As character) I must have in school, but...
Daniel Day-lewis: (As Abraham Lincoln) I never had much schooling. But I read Euclid in an old book I borrowed. Little enough found its way in here, but once learned, it stayed there. Euclid's first common notion is this - things which are equal to the same thing are equal to each other. That's a rule of mathematical reasoning. It's true because it works - has done and always will do. In his book, Euclid says this is self-evident. You see, there it is. Even in that 2,000-year-old book of mechanical law, it is a self-evident truth that things which are equal to the same thing are equal to each other.
Rick Howard: Now, I'm a firm believer in using the tool that's fitted to the task. If the frequentist tools fit, use those. If the Bayesian tools are a better choice, use those. The great thing about Bayes' rule is that you can use both. At this point, the Bayesian math tool set has so many examples of solving real-world problems, it seems ludicrous to argue against it. And it's clear today in our cybersecurity evolution that the frequentist tool set have not helped in predicting cybersecurity risk. What I'm advocating here is that it's time for the security community to try a new set of tools.
Rick Howard: McGrayne's book, "The Theory That Would Not Die," is a delightful history of the theory's evolution from creation to modern day - its successes and failures and blood feuds between mathematicians over the years. I highly recommend it if this subject intrigues you, and it should. If you're looking for a Reader's Digest version, though, I've linked to her Google talk in the show notes. In the book, she outlines over 20 success stories, tales where scientists used Bayes' rule to solve complex problems over the years. I've summarized them in the essay that accompanies this podcast. But out of all of them, my favorite story is how Alan Turing used the Bayes rule to crack the German code machine Enigma in World War II. Turing is my favorite all-time computer science hero. In his short and tragic life, he accomplished so many things. In the 1930s, he mathematically proved that computers were possible with something called the Turing Machine some 10 years before we were actually able to make them. Today, every single computer you use - from your smartphone to your laptop to your workloads in the cloud - are all Turing machines. In 1950, he published the first test for artificial intelligence, the Turing test, that leaders in the field are still debating today. And during World War II, his six years of work at Bletchley Park breaking German encrypted messages, according to some historians, probably saved 20 million lives and shortened the war by four years. And he used the Bayes' hypothesis to do it.
Rick Howard: There were many versions of the Enigma machine before, during and after the war. But in general, the encryption machinery consisted of four mechanical parts. The first part, the keyboard - coders would type the unencrypted message one letter at a time on something that looked like a typewriter. When they pressed the unencrypted letter, the transformed encrypted letter would light up. The coder would write that letter down in an encrypted message for transmission later via a Morse code radio transmission.
Rick Howard: The second part, the plugboard - using 26 sockets, one socket for each letter in the alphabet, coders would use steckers - spelled S-T-E-C-K-E-R-S - to plug one letter into another one - say. Q to R. This had the effect of swapping the values. If the coder typed Q on the keyboard, R would go through the system.
Rick Howard: The third part, the rotors - each rotor, a ring with a unique arrangement of 26 letters, had a starting position that coders changed on a regular basis. In a three-rotor system, the machine came with five different rotors to choose from. Each rotor performs a simple substitution cipher right to left, one after the other. For example, the pin corresponding to the letter R might be wired to the contact for letter T. When the coder pressed a key on the keyboard, the right rotor would rotate forward one letter. This ensured that even if the coder type the same letter twice, the encrypted letters would be different. Once the right rotor clicked over 26 times, the middle rotor would click to the next letter. Once the middle rotor clicked 26 times, the left rotor would click to the next letter. The result was more than 17,000 different combinations before the system repeated itself.
Rick Howard: The last part, the reflector - once the signal passed through the plugboard and through the three rotors right to left, it passed through the reflector that redirected the signal back through the rotors - this time left to right - and then back through the plugboard and, finally, back to the keyboard to light up the encrypted letter. All in all, each individual unencrypted letter went through eight transformations - plugboard, three rotors right to left, three rotors left to right, plugboard. With this system, the number of ways the Germans could scramble a message was nearly 159 quintillion. That's 159 followed by 10 zeros.
Rick Howard: You're listening to the movie soundtrack to the 2014 movie "The Imitation Game" about how Turing and his colleagues broke Enigma. It's one of my favorite movies of the last decade. The title is taken from his paper, where he described the Turing test for artificial intelligence. And I've been telling people for years that a scene in this movie is the best layman's description for AI that I have ever come across. Here's Benedict Cumberbatch playing Turing.
(SOUNDBITE OF FILM, "THE IMITATION GAME")
Rory Kinnear: (As Robert Nock) Could machines ever think as human beings do?
Benedict Cumberbatch: (As Alan Turing) Most people say no.
Rory Kinnear: (As Robert Nock) You're not most people.
Benedict Cumberbatch: (As Alan Turing) Well, the problem is you're asking a stupid question.
Rory Kinnear: (As Robert Nock) I am?
Benedict Cumberbatch: (As Alan Turing) Of course, machines can't think as people do. A machine is different from a person, hence they think differently. The interesting question is just because something thinks differently from you, does that mean it's not thinking? You know, we allow for humans to have such divergences from one another. You like strawberries. I hate ice skating. You cry at sad films. I am allergic to pollen. What is the point of different tastes, different preferences, if not to say that our brains work differently, that we think differently? And if we could say that about one another, then why can't we say the same thing for brains built of copper and wire, steel?
Rory Kinnear: (As Robert Nock) And that's - this big paper you wrote - what's it called?
Benedict Cumberbatch: (As Alan Turing) The imitation game.
Rory Kinnear: (As Robert Nock) Right, that's what it's about.
Benedict Cumberbatch: (As Alan Turing) Would you like to play?
Rory Kinnear: (As Robert Nock) Play?
Benedict Cumberbatch: (As Alan Turing) It's a game, a test of sorts, for determining whether something is a machine or a human being.
Rory Kinnear: (As Robert Nock) How do I play?
Benedict Cumberbatch: (As Alan Turing) Well, there's a judge and a subject. The judge asks questions and, depending on the subject's answers, determines who he is talking with or what he is talking with. And all you have to do is ask me a question.
Rick Howard: But back to Enigma - according to McGrayne, Turing, with the help of mathematician Gordon Welchman and engineer Harold "Doc" Keen, designed a high-speed electromechanical machine for testing every possible wheel arrangement in Enigma. Turing called the machine the Bombe. His radical Bayesian design tested hunches - 15-letter tidbits suspected of being in the original message. Because it was faster to toss out possibilities than to find the one that fit, Turing's Bombe simultaneously tested for wheel combinations that could not work. He also invented the manual Bayes system called Banburismus, named after a nearby town of the same name where Bletchley Park got supplies. This system let him guess a stretch of letters in an Enigma message, hedge his bets, measure his belief in their validity by using Bayesian methods to assess their probabilities and add more clues as they arrive. This system could identify the settings for two of Enigma's three wheels and reduce the number of wheel settings to be tested on the Bombes from 336 to as few as 18.
Rick Howard: Now, breaking Enigma codes was time-sensitive. The Germans changed their Enigma settings - plug board and rotor configurations - routinely, most times daily, but sometimes every 8 hours. Turing needed a way to measure his priors - his hunches from Banburismus. He invented the ban - spelled B-A-N, short for Banburismus - which, according to Irving John "Jack" Good, one of Turing's closest associates at Bletchley, measured the smallest weight of evidence perceptible to the intuition. The way that McGrayne describes it is one ban represented odds of 10-to-1 in favor of a guess. But Turing normally dealt with much smaller quantities - decibans and even centibans. When the bans added up to 50-to-1, cryptanalysts were almost certain that their 15-letter tidbits were correct. According to McGrayne, each ban made a hypothesis 10 times more likely. Remember, Turing was trying to find ways to discard hunches quickly, not find the exact answer. When he got a 50-to-1, he could stop the process.
Rick Howard: Now, if you think Turing's bans sound eerily similar to Claude Shannon's bits, you'd be right. Shannon published his groundbreaking paper, "A Mathematical Theory of Communication," in 1948. And according to the science site HRF, he defines the smallest units of information that cannot be divided any further. These units are called bits, which stand for binary digits. Strings of bits can be used to encode any message. Digital coding is based around bits and has just two values - zero or one. Shannon introduced the idea of information entropy, that, according to Jane Stewart Adams in a fabulous essay called "The Ban and the Bit: Alan Turing, Claude Shannon, and the Entropy Measure" - link in the show notes - information wasn't contained in the bits themselves, but how disordered they were when they arrived. According to James Gleick, author of "The Information: A History, a Theory, a Flood," quote, "a Shannon bit was a fulcrum around which the world began to turn. The bit now joined the inch, the pound, the quart, and the minute as a determinate quantity - a fundamental unit of measure. But measuring what? A unit of measuring information, Shannon wrote, as though there were such a thing measurable and quantifiable as information," end quote.
Rick Howard: According to Good, Turing's colleague, Turing independently invented bans in 1941, seven years before the Shannon paper. The interesting thing is that Turing actually spent several days with Shannon in the United States in 1943. The intriguing question is, did these two men talk about bans and bits when they met? In other words, did Turing give Shannon the idea? Shannon emphatically says no, and I believe him. Turing, at the time, was still working under Britain's secrecy act. Only a handful of allies actually knew the full story about what was going on at Bletchley Park. Turing was one of them, but he never talked about Enigma outside of those circles, even when he was arrested and threatened with prison later. It's a weird coincidence, though, and makes you wonder.
Rick Howard: At the height of the war, Bletchley Park was a codebreaking factory, with as many as 200 Bombes running at any given time, supported by some 9,000 people. Turing and all of the codebreakers at Bletchley Park made it possible for allied leaders to see Hitler's orders, most times before the German commanders in the field saw them. Turing's tragedy resulted from two facts - he was gay, and the British were implacable about the need to keep their codebreaking capabilities secret. Many Bletchley Park workers went to their graves without anybody in their families knowing the significance of what they were doing during the war. After the British Prime Minister, Winston Churchill, gave the order to destroy most of the Bombes except for a handful, to keep the secrets safe, he used the remaining Bombes and its successors like the Colossus to spy on the Russians after the war, and he didn't want anybody to know that he could do it.
Rick Howard: Codebreaking was so secret that, after the war, nobody outside the small and cleared codebreaking community knew who Turing was or what he accomplished, or even that the Bayes rule was a good method to use in cryptanalysis. And then, according to McGrayne, paranoia captured the West's imagination. The Soviets detonated their first atomic bomb, China became a communist country, and we found spies everywhere - Alger Hiss, Klaus Fuchs and Julius and Ethel Rosenberg. Senator Joseph McCarthy in the States accused prominent U.S. citizens of being communist. Two openly gay English spies, Guy Burgess and Donald Maclean, escaped to the USSR. American intelligence warned the Brits about another homosexual spy, Anthony Blunt. Leaders on both sides of the pond were worried about an international homosexual spy ring. The Americans banned gays from entering the country, and the Brits started arresting homosexuals in droves. And that's what happened to Turing. He got arrested for being gay. And since nobody knew who he was and his work was so secret, no government official stepped up to vouch for him or to protect him.
Rick Howard: According to McGrayne, as the world lionized the Manhattan Project physicist who engineered the atomic and hydrogen bombs, as Nazi war criminals went free and as the United States recruited German rocket experts - the main one, Wernher von Braun, the brain behind the V-2 rockets that terrorized London during the war, made several space documentaries with Walt Disney, of all people, in the late 1950s. But Turing was found guilty. Less than a decade after England fought a war against Nazis who had conducted medical experiments on their prisoners, an English judge forced Turing to choose between prison and chemical castration. Turing chose castration, a barbaric practice of estrogen injections designed with no scientific merit to curb his sexual preference. He grew breasts, and the drugs messed with his mind. On June 7, 1954 - two years after he was arrested - he committed suicide at the age of 42.
Rick Howard: I first learned about Turing in the early 2000s after I read Neal Stephenson's novel "Cryptonomicon." Over the years since, I have kept picking up pieces of Turing's story. The stark tragedy of this is really hard to take, even for me, and I've reread the story many times. For me, it's like going through the grieving process. One of our greatest minds, one of our most brilliant mathematicians, and one who almost single-handedly saved 20 million lives, was cut down in his prime at the age of 42 - 42 - alone, with no friends or colleagues, with nobody seeing who he really was at a time when it mattered most. I just want to raise my fist to the skies in rage. And the mind boggles just thinking about the could-have-beens. What would Turing have done with artificial intelligence if left to himself after the war? What computers would he have helped build? What would he and Shannon have done together to advance information theory? And what progress could he have made in Bayes' Theorem?
Rick Howard: In the movie "The Imitation Game," the actress Keira Knightley, playing Turing's former fiancee Joan Clarke, comes to Turing near the end of his life to give him comfort about the significance of his work.
(SOUNDBITE OF FILM, "THE IMITATION GAME")
Benedict Cumberbatch: (As Alan Turing) You got what you wanted, didn't you? Work, husband, normal life.
Keira Knightley: (As Joan Clarke) No one normal could have done that. You know, this morning, I was on a train that went through a city that wouldn't exist if it wasn't for you. I bought a ticket from a man who would likely be dead if it wasn't for you. I read up on my work, a whole field of scientific inquiry that only exists because of you. Now, if you wish you could have been normal, I can promise you I do not. The world is an infinitely better place precisely because you weren't.
Benedict Cumberbatch: (As Alan Turing) Do you really think that?
Keira Knightley: (As Joan Clarke) I think that sometimes it is the people who no one imagines anything of who do the things that no one can imagine.
Rick Howard: We'll be right back with a wrap up of Bayes' rule.
Rick Howard: As I said, the Bayes rule is the third leg to our risk forecasting stool alongside some super forecasting techniques and Fermi estimates. The idea that you can make a broad estimate about your risk with nothing more than an educated guess - the initial cue ball - and then refine that guess over time with new information as it becomes available - rolling many more balls on the table - is genius. You don't have to have years and years of actuary table data before you can calculate the risk. You don't have to count all the things first. And it's not just a good idea either. It's supported by 250 years of math evolution, from Thomas Bayes to Pierre Laplace to Alan Turing to Bill Gates. For years, I've been trying to get my head around how to calculate cyber-risk for my own organization with enough precision to make decisions with. With super forecasting, Fermi estimates and now the Bayes rule, I'm convinced that the path ahead is clear.
Rick Howard: And that's a wrap. And as always, if you agree or disagree with anything I've said, hit me up on LinkedIn or Twitter, and we can continue the conversation there. Or if you prefer email, drop a line to firstname.lastname@example.org - that's C-S-O-P, the @ sign, thecyberwire - all one word - dot com. And if you have any questions you would like us to answer here at "CSO Perspectives" send a note to the same email address, and we will try to address them in the show. For next week's show, I'm going to walk us through an example of how to calculate our first prior using some Fermi estimates to forecast cyber-risk. You don't want to miss that.
(SOUNDBITE OF TV SHOW, "BATMAN")
William Dozier: (As Narrator) Same bat-time, same bat-channel.
Rick Howard: The CyberWire's "CSO Perspectives" is edited by John Petrik and executive produced by Peter Kilpe. Our theme song is by Blue Dot Sessions, remixed by the insanely talented Elliott Peltzman, who also does the show's mixing, sound design and original score. And I'm Rick Howard. Thanks for listening.