Bayes Rule: A different way to think about cybersecurity risk.
In the last essay, I talked about superforecasters from Dr Tetlock’s book of the same name. He makes the case, and I agree with him, that it’s possible to forecast answers to highly complex questions, queries that seemingly no one could possibly answer because there are no prior data or history of occurence, with enough accuracy to make meaningful decisions in the real world. Specifically, I believe we can use his superforecasting techniques to estimate the probability of material impact to our own organizations due to a cyber attack.
From Tetlock’s book, I call one of those techniques the Fermi-outside-in-back-of-the-envelope calculation. That’s a long phrase that means basically informed guessing. As you recall, Dr. Enrico Fermi was famous for his ability to estimate the results of real world experiments by breaking the problem down into smaller and smaller more manageable problems which he could reasonably guess the answer to. His calculations weren’t precise but they were close enough to be in the same ballpark as the real answer. I made the case in the last essay that for most cybersecurity resource decisions about where to spend money, where to allocate people, and where to refine business processes, security practitioners don’t need to have high precision answers. Fermi estimates are probably good enough. And, it’s not an either-or decision. In those unusual cases where you absolutely need the precise answers, you can still do the work to find them.
Superforecasting techniques in general and specifically Fermi-outside-in-back-of-the-envelope calculations are two legs to the cybersecurity risk forecasting stool. The third leg is something called the Bayes Rule, and it's the mathematical foundation that proves that superforecasting techniques and Fermi estimates work. The great news though, is that CISOs like me don’t have to perform higher order math to make it work for us. We just have to understand the concept and apply it to our day-to-day risk assessments. We can use basic statistics in the general case and expert opinion from our internal staff to get an initial estimate. We can then modify the forecast based on how well our organizations do in adhering to our set of cybersecurity first principles. Before I can show how to do that though, let me talk first about Bayes' Theorem.
The Bayesian interpretation of probabilities comes from Thomas Bayes, who penned the original thesis back in the 1740s. But what is not commonly known is that nobody would have heard about the idea if it weren’t for his best friend, Richard Price. Price, no slouch in the science department himself, found Bayes’s unpublished manuscript in a drawer after Bayes died, realized its importance, spent two years fixing it up, and sent it to the Royal Society of London for publication in 1763.
In the manuscript, Bayes (with Price) describes a thought experiment to illustrate his hypothesis. Bayes asks the reader to imagine a billiard table and two people, the guesser and the assistant. The guesser turns her back and the assistant rolls the cue ball onto the table and lets it settle somewhere. The guesser’s job is to forecast where the cue ball is located on the flat surface. She has a piece of paper, a pencil, and draws a rectangle to represent the level platform. The assistant then rolls a second ball and tells the guesser if it settled to the right or to the left of the cue ball. The guesser makes an initial estimate on the paper as to which side of the table the cue ball resides. The assistant then rolls a third ball and tells the guesser which side of the cue ball it landed. Based on that information, the guesser adjusts her initial estimate. The more balls the assistant rolls, the more precise the guesser gets with her forecast. She will never know exactly where the cue ball is, but she can get fairly close.
This, in essence, is Bayes's thesis. We can have an initial estimate of the answer no matter how broad that might be (somewhere on the billiard table) and gradually collect new evidence, right or left of the cue ball, that allows us to adjust that initial estimate to get closer to the real answer.
According to Sharon McGrayne, author of the 2011 book, “The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy,” “By updating our initial beliefs with objective new information, we get a new and improved belief.” She says that “Bayes is a measure of belief. And it says that we can learn even from missing and inadequate data, from approximations, and from ignorance.”
Even though Bayes was a mathematician, he didn’t work out the actual probabilistic formula, called Bayes Rule, that is used today. That didn’t come until Pierre Simon Laplace, the French mathematician, astronomer, and physicist, who was best known for his investigations into the stability of the solar system and his discovery of the Central Limit Theorem, identified independently, in 1774, the same notion that Bayes did and spent the next forty years working out the math. Today, we attribute Bayes Theorem to Thomas Bayes because of scientific convention (he was the first to come up with the idea). But, in reality, according to McGrayne, we should call it the Bayes-Price-Laplace algorithm. Without Price and Laplace, Bayes Theorem would never have seen the light of day.
Modern Bayesian scientists use words like, “The Prior”, to represent the initial estimate (the cue ball by itself on the table), “The Likelihood,” to represent the probability of the new information we are receiving (where is the cue ball in relation to the second ball), and “The Posterior,” to represent the new estimate after we combine “The Prior” and “The Likelihood” in Bayes Theorem. According to McGrayne, “Each time the system is recalculated, the posterior becomes the prior of the new iteration.”
That is an elegant idea but the scientific community completely rejected the thesis after the Royal Society published Bayes’ manuscript. You have to remember that at the time, science was moving away from religious dogma as a way to describe the world. These new scientist-statisticians, called the frequentists, were basing everything on observable facts. They had to count things like the number of cards in a deck before they would feel comfortable predicting the odds of an Ace showing up on the flop. The idea that you could brand Bayes’ fuzzy estimates as science without observable facts was anathema, and leading statisticians attacked it at every opportunity for the next 150 years.
To them, modern science required both objectivity and past knowledge. According to Hubbard and Siersen in their book, “How to Measure Anything in Cybersecurity Risk,” Gottfried Achenwall introduced the word “statistics” in 1749 derived from the latin word “statisticum,” meaning ‘pertaining to the state.’ Statistics was literally the quantitative study of the state. According to McGrayne, in the frequentist view, Bayesian philosophy requires a measure of “... belief and approximations. It is subjectivity run amok, ignorance coined into science.”
But the real world has problems where the data are scant. Leaders worry about potential events that have never happened but are possible (like a future ransomware attack). Bayesian philosophy was a way to estimate answers that were useful in the real world and outsiders to the statistical community began experimenting with the method to attain real results.
Amazingly, after 280 years, Bayes rule still meets with friction in the scientific community. There still seems to be an attitude in some circles of one or the other; either you're a frequentist or a Bayesian. That’s a shame, because, like Euclid’s first-principle math rules, the Bayesian Rule is true because it works. I'm a firm believer in using the tool that’s fitted to the task. If the frequentist’s tools fit, use those. If the Bayesian tools are a better choice, use those. The great thing about Bayes Rule is that you can use both. At this point, the Bayesian math tool set has so many examples of solving real world problems, it seems ludicrous to argue against it. And it’s clear at this point in the cybersecurity evolution, that frequentist’s tool sets have not helped in predicting cybersecurity risk. What I'm advocating here is that it's time for the security community to try a new set of tools.
Alan Turing and Bayes.
I mentioned McGrayne’s book, “The Theory That Would Not Die.” It’s a delightful history of the thoery’s evolution from creation to modern day, its successes and failures, and blood feuds between mathematicians over the years. I highly recommend it if this subject intrigues you, and it should. The author gave a Google Talk about the book back in 2011 if you’re looking for a Reader’s Digest version (the YouTube link is in the references section). In the book, she outlines over 20 success stories, tales where scientists used Bayes Rule to solve complex problems over the years. I’ve summarized them at the end of this essay. But my favorite story is how Alan Turing used the Bayes Rule to crack the German code machine, Enigma, in WWII.
Turing is my all time favorite computer science hero. In his short and tragic life, he accomplished so many things. In the 1930s, he mathematically proved that computers were possible (with the Turing Machine) some ten years before we were actually able to make them. Today, every single computer you use, from your smartphone to your laptop to your workloads in the cloud, are all Turing Machines. In 1950, he published the first test for artificial intelligence (the Turing Test) that leaders in the field are still debating today. And during WWII, his six years of work at Bletchley Park breaking German encrypted messages, according to some historians, probably saved 20 million lives and shortened the war by four years. And he used the Bayes hypothesis to do it.
There were many versions of the Enigma machine before, during, and after the war, but in general, the encryption machinery consisted of four mechanical parts:
Keyboard: Coders would type the unencrypted message, one letter at a time, on something that looked like a typewriter. When they pressed the unencrypted letter on the keyboard, the transformed encrypted letter would light up. The coder would write that letter down in an encrypted message for transmission later via morse code radio transmission.
Plugboard: Using 26 sockets, one socket for each letter in the alphabet, coders would use “Steckers” to plug one letter into another one, say Q to R. This had the effect of swapping the values. If the coder typed Q on the keyboard, R would go through the system.
Rotors: Each roter, a ring with a unique arrangement of 26 letters, had a starting position that coders changed on a regular basis. In a three rotor system, the machine came with five different rotors to choose from. Each rotor performs a simple substitution cipher. For example, the pin corresponding to the letter R might be wired to the contact for letter T. When the coder pressed a key on the keyboard, the right rotor would move forward one letter. This ensured that even if the coder typed the same letter twice, the encrypted letters would be different. Once the right rotor clicked over 26 times, the middle rotor would click to the next letter. Once the middle rotor clicked 26 times, the left rotor would click to the next letter. The result was more than 17,000 different combinations before the system repeated itself.
Reflector: Once the signal passed through the plugboard and through the three rotors, it passed through the reflector that redirected the signal back through the rotors, this time left to right, and then back through the plugboard, and finally, back to the keyboard to light up the encrypted letter.
All in all, each individual unencrypted letter went through eight transformations: plugboard - three rotors right to left - three rotors left to right - plugboard. With this system, the number of ways the Germans could scramble a message was nearly 159 quintillion; that’s 159 followed by ten zeroes.
According to McGrayne, Turing, with the help of mathematician Gordon Welchman and engineer Harold “Doc” Keen, designed a “high-speed electromechanical machine for testing every possible wheel arrangement in an Enigma.” Turing called the machine: The Bombe. His radical Bayesian design “tested hunches, 15-letter tidbits suspected of being in the original message. Because it was faster to toss out possibilities than to find one that fit, Turing’s bombe simultaneously tested for wheel combinations that could not produce the hunch.” He also invented the manual Bayes system called Banburismus that “let him guess a stretch of letters in an Enigma message, hedge his bets, measure his belief in their validity by using Bayesian methods to assess their probabilities, and add more clues as they arrived.” This system could “identify the settings for 2 of Enigma’s 3 wheels and reduce the number of wheel settings to be tested on the bombes from 336 to as few as 18.”
Breaking Enigma codes was time sensitive. The Germans changed their Enigma settings (plugboard and rotor configurations) routinely, most times daily but sometimes every eight hours. Turing needed a way to measure his priorors, his hunches from Banburismus. He invented the “ban” (short from Banburismus) which according to Irving John (Jack) Good (one of Turing’s closest associates at Bletchley) “measured the smallest weight of evidence perceptible to the intuition.” The way that McGrayne describes it, “One ban represented odds of 10 to 1 in favor of a guess, but Turing normally dealt with much smaller quantities, decibans and even centibans.” When the bans added up to 50 to 1, cryptanalysts were almost certain that their 15-letter tidbits were correct. According to McGrayne, “Each ban made a hypothesis 10 times more likely.” Remember, Turing was trying to find ways to discard hunches quickly, not find the exact answer. When he got to 50-1, he could stop the process.
If you think Turing’s “bans” sound eerily similar to Claude Shannon’s “bits,” you’d be right. Shannon published his groundbreaking paper, "A Mathematical Theory of Communication” in 1948 and according to the science site hrf, he “defines the smallest units of information that cannot be divided any further. These units are called bits, which stand for binary digits. Strings of bits can be used to encode any message. Digital coding is based around bits and has just two values: 0 or 1.” Shannon introduced the idea of information entropy, that, according to Jane Stewart Adams in a fabulous essay called,”The Ban and the Bit: Alan Turing, Claude Shannon, and the Entropy Measure,” information wasn’t contained in the bits themselves but how disordered they were when they arrived.
According to James Gleick, author of “The Information: A History, a Theory, a Flood,” a Shannon bit “was a fulcrum around which the world began to turn …The bit now joined the inch, the pound, the quart, and the minute as a determinate quantity — a fundamental unit of measure. But measuring what? ‘A unit for measuring information,’ Shannon wrote, as though there were such a thing, measurable and quantifiable, as information.”
According to Good, Turing independently invented bans in 1941, seven years before the Shannon paper. The interesting thing is that Turing actually spent several days with Shannon in the United States in 1943. The intriguing question is did these two men talk about bans and bits when they met? In other words, did Turing give Shannon the idea? Shannon emphatically says no and I believe him. Turing was still working under Brittain’s Secrecy Act. Only a handful of Allies actually knew what was going on at Bletchley Park at the time. Turing was one of them but he never talked about Enigma outside of those circles even when he was arrested and threatened with prison later. It’s a weird coincidence though and makes you wonder.
At the height of the war, Bletchley Park was a code breaking factory with as many as 200 bombes running at any given time supported by some 9,000 people. Turing, and all the codebreakers at Bletchley Park, made it possible for Allied leaders to see Hitler’s orders most times before the German commanders in the field saw them. Turing’s tragedy resulted from the two facts: he was gay, and the British were implacable about the need to keep their code breaking capabilities secret. Many Bletchley Park workers went to their graves without anybody in their families knowing the significance of what they did during the war. After, the British Prime Minister, Winston Churchill, gave the order to destroy all the bombes except for a handful to keep the secret safe. He used the remaining bombes and its successors, like the Colossus, to spy on the Russians after the war and he didn’t want anybody to know that he could do it.
Codebreaking was so secret that after the war, nobody outside the small and cleared codebreaker community knew who Turing was, or what he accomplished, or even that the Bayes Rule was a good method to use in Cryptanalysis. And then, according to McGrayne, paranoia captured the west’s imagination. The Soviets detonated their first atomic bomb. China became a communist country. We found spies everywhere: Alger Hiss, Klaus Fuchs, and Julius and Ethel Rosenberg. Senator Joseph McCarthy accused prominent U.S. citizens of being communist. Two openly gay english spies, Guy Burgess and Donald Maclean, escaped to the USSR. American intelligence warned the brits about another homosexual spy: Anthony Blunt. Leaders on both sides of the pond were worried about an international homosexual spy ring. The Americans banned gays from entering the country and the brits started arresting homosexuals in droves.
And that’s what happened to Turing. He got arrested for being gay and since nobody knew who he was, and his work was so secret, no government official stepped up to vouch for him or to protect him. According to McGrayne, “As the world lionized the Manhattan Project physicists who engineered the atomic and hydrogen bombs, as Nazi war criminals went free, and as the United States recruited German rocket experts, Turing was found guilty. Less than a decade after England fought a war against Nazis who had conducted medical experiments on their prisoners, an English judge forced Turing to choose between prison and chemical castration.” Turing chose castration; a series of estrogen injections designed (with no scientific credence) to curb his sexual preference. He grew breasts and the drugs messed with his mind. On June 7th, 1954, two years after he was arrested, he committed suicide at the age of 42.
I first learned about Turing in the early 2000s after I read Neal Stephenson’s novel, “Cryptonomicon.” Over the years since, I kept picking up pieces of Turing’s story. The stark tragedy of this is hard to take, even for me, and I’ve re-read this story many times. For me, it’s like going through the grieving process. One of our greatest minds, one of our most brilliant mathematicians, and one who almost single handedly saved 20 million lives, was cut down in his prime at the age of 42, alone, with no friends or colleagues, with nobody seeing who he really was at a time when it mattered most. I just want to raise my fist to the skies and rage. And the mind boggles just thinking about the could-have-beens. What would Turing have done with artificial intelligence if left to himself after the war? What computers would he have helped build? What would he and Shannon have done together to advance information theory? What progress could we have made in Bayes Theorem?
Consider Bayes rule for cybersecurity risk forecasting.
As I said, the Bayes Rule is the third leg to our risk forecasting stool alongside some superforecasting techniques and Fermi estimates. The idea that you can make a broad estimate about your risk with nothing more than an educated guess (the initial cue ball) and then refine that guess over time with new information as it becomes available (rolling many more balls on the table) is genius. You don’t have to have years and years of actuary table data before you can calculate the risk. You don’t have to count all the things first. And, It’s not just a good idea either. It’s supported by 250 years of math evolution from Thomas Bayes to Pierre Simon Laplace, to Alan Turing, and to Bill Gates.
For years, I’ve been trying to get my head around how to calculate cyber risk for my organization with enough precision to make decisions with. With superforecasting, Fermi estimates and the Bayes Rule, the path ahead is clear. In the next essay, I will demonstrate how to do it. I'm going to go through an example of how to calculate our first prior using some Fermi estimates to estimate cyber risk. Stay tuned.
Bayes Success Stories (Summarized from Sharon McGrayne’s book: “The Theory That Would Not Die.”
- Mathematician Joseph Louis François Bertrand reformed Bayes' thesis for artillery trajectory tables that had to consider a host of uncertainties: enemy location; air density; wind direction; cannon types, and projectile variations. For the next 60 years, between the 1880s and the Second World War, French and Russian artillery officers fired their weapons according to Bertrand’s textbook.
- French General Jean Baptiste Eugène Estienne created Bayesian tables to efficiently test scarce ammunition during WWI.
- In the United States between 1911 and 1920, all but eight states passed laws protecting workers against occupational injuries and illness. Albert Wurts Whitney, a specialist in insurance mathematics from Berkeley, used Bayes to set the first casualty fire and workers’ compensation insurance premiums where no actuary data existed yet,
- Edward C. Molina, an AT&T engineer, used Bayes to redesign the collapsing Bell Telephone system in 1907.
- Harold Jeffreys, between the 1930s and 1940s, used Bayes to forecast earthquake epicenters combining inaccurate seismic readings (like Bayes’ billiard table)
- French mathematician and physicist, Henri Poincaré, used Bayes’ thesis to refute some quackery statistics from Alphonse Bertillon, a police criminologist, on the conviction of French Jew Alfred Dreyfus in 1894 falsely accused and imprisoned for being a German spy. [Note from my editor John Petrik: Bertillion was quacky, and he was mainstream enough in the 1930s and 1940s that Batman used to consult the Gotham PD's "Bertillion Files."]
- In 1925, Egon Pearson published an exploration of Bayesian methods using priors for a series of whimsical experiments calculating: the fraction of London taxi cabs with LX license plates, men smoking pipes on Euston Road, horse-drawn vehicles on Gower Street, chestnut colts born to bay mares, and hounds with fawn-spotted coats.
- In 1936, Lowell J. Reed, a medical researcher at Johns Hopkins University, used Bayes to determine the X-ray dosages that would kill cancerous tumors but leave patients unharmed when no precise exposure records existed.
- Alan Turing developed Bayes to break Enigma during WWII.
- During WWII, mathematician Bernard Osgood Koopman of Columbia University used Bayes to find German u-boats.
- In the 1950s, US Department of Defense researchers used Bayes to predict the reliability of the new intercontinental ballistic missiles.
- In 1958, Albert Madansky, working for the Rand Corporation, used Bayes to predict
- The likelihood of a “conspicuous” atomic bomb accident was rising, and it was in the military’s interest to make its nuclear arsenal safer. The impact was that General Curtis LeMay and President John Kennedy ordered significant upgrades to nuclear arsenal safety procedures to include two-man control.
- In 1962 Jerome Cornfield used Bayes to identify the most critical risk factors for cardiovascular disease that resulted in the drop of fatalities between 1960 and 1996 by 60% (621,000 fatalities).
- In 1964, the U.S. surgeon general concluded that “cigarette smoking is causally related to lung cancer in men,” citing the Bayesian studies of Jerome Cornfield.
- In 1964, Frederick Mosteller of Harvard University and David Wallace, from the University of Chicago, used Bayes to prove that James Madison wrote 12 of the previously unattributed 85 Federalist Papers instead of Alexander Hamilton.
- Late 1960s, Coast Guard rescue coordinator Joseph Discenza used Bayes to find lost ships.
- 1974, Norman Carl Rasmussen used Bayes to predict that the probability of core damage to nuclear commercial power was higher than expected, but the consequences would not always be catastrophic presciently two months before the three-mile island accident.
- In 1983, A U.S. Air Force contractor (Teledyne Energy Systems) used Bayes to analyze the risk of a Challenger space shuttle accident and predicted the probability of a rocket booster failure at 1 in 35. On January 28, 1986, during the shuttle’s twenty-fifth launch, the Challenger exploded, killing all seven crew members aboard.
- In 1996 Bill Gates, cofounder of Microsoft, announced that Microsoft’s competitive advantage lay in its expertise in Bayesian networks.
- Susan Holmes and Daphne Koller, both of Stanford University, use Bayes to crack the genetic code on amino acids.
“Alan Turing: The Codebreaker Who Saved ‘Millions of Lives,’” by Prof Jack Copeland, BBC News, 19 June 2012.
“A List of Properties of Bayes-Turing Factors,” By I. J. GOOD, NSA FOIA Case #58820, 9 March 2011.
"A Mathematical Theory of Communication," By C. E. SHANNON, Reprinted with corrections from The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948.
"An Essay towards solving a Problem in the Doctrine of Chances," by Thomas Bayes, letter communicated by Richard Price to John Canton, 1 January 1763.
"BOOK REVIEW: SUPERFORECASTING,” BY SCOTT ALEXANDER, Slate Star Codex, 4 February 2016.
“Claude Shannon’s Information Theory Explained,” by HRF, 18 March 2017.
“Computer Pioneers - Irving John Good,” by J.A.N. Lee, Computer.org, 2012.
"Computing Machinery and Intelligence: The Imitation Game," By A. M. Turing, Mind, 1950.
“Cryptonomicon,” by Neal Stephenson, Published by Avon, May 1999.
“Enigma,” by Cyber.org, 17 June 2020.
“Fermi Estimations,” by Bryan Braun, 4 December 2011.
“Fermi Problems: Estimation,” by TheProblemSite.com, 2022.
“How Did the Enigma Machine Work?” by Alex Hern, The Guardian, 14 November 2014.
“How Superforecasting Can Help Improve Cyber-Security Risk Assessment,” By Sean Michael Kerner, eWeek, 6 March 2019.
"How to Measure Anything in Cybersecurity Risk," by Douglas W. Hubbard, Richard Seiersen, Published by Wiley, 25 April 2016.
"How to predict the future better than anyone else,” By Ana Swanson, 4 January 2016.
"‘Mindware’ and ‘Superforecasting’," By Leonard Mlodinow, 15 October 2015.
"On Computable Numbers with an Application to the ENTSCHEIDUNGSPROBLEM," by Alan Turing, Proceedings of the London Mathematical Society, 1937.
“Pierre-Simon, Marquis de Laplace,” Encyclopædia Britannica, 2022.
“Probability and the Weighing of Evidence,” by I.J. Good, Published by Charles Griffin. 1950.
“Richard Price,” by David McNaughton, Stanford Encyclopedia of Philosophy), 2019.
"Superforecasting: The Art and Science of Prediction,” by Philip E. Tetlock and Dan Gardner, 29 September 2015, Crown.
“Superforecasting: Summary and Review," by HowDo, 16 June 2021.
"Superforecasting: The Art and Science of Prediction,” by Philip E. Tetlock and Dan Gardner, 29 September 2015, Crown.
“The Art of Approximation in Science and Engineering,” by Sanjoy Mahajanm, Electrical Engineering and Computer Science, MIT OpenCourseWare,” 2022.
“The Ban and the Bit: Alan Turing, Claude Shannon, and the Entropy Measure,” by Jane Stewart Adams, thejunglejane, 14 June 2014.
“The Enigma Machine: Its Construction, Operation and Complexity,” by Graham Ellsbury, 1998
“The Fermi Rule: Better Be Approximately Right than Precisely Wrong,” by Nagesh Belludi, Right Attitudes, 28 August 2017.
“The Foundations of Decision Analysis Revisited,” by Ronald Howard, Chapter 3, 060520 V10.
“The Information: A History, a Theory, a Flood,” by James Gleick, Published by Knopf Doubleday, 1 March 2011
"The Theory That Would Not Die," by Sharon Bertsch McGrayne, Talks at Google, 23 August 2011.
“The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy,” by Sharon Bertsch McGrayne, Published by Yale University Press, 14 May 2011.
"Thomas Bayes and Bayesian probabilities," by Richard Gregory, recorded June 2006, published by Web of Stories, 30 March 2018.