chaos engineering (noun)

Transcript

Rick Howard: The word is: chaos engineering.

Rick Howard: Spelled: chaos as in planned disorder, and engineering as in applied science.

Rick Howard: Definition: The resilience discipline of controlled stress test experimentation in continuous integration/continuous delivery environments, CI/CD environments, to uncover systemic weaknesses.

Rick Howard: Example sentence: chaos engineering is the scientific method process of building a hypothesis around expected software behavior, designing small footprint or tiny blast radius experiments that vary steady state behavior like bandwidth and CPU use and running those experiments in production systems to learn about unknown system weaknesses.

Rick Howard: Origin and context: chaos engineering began in 2008 with a couple of delivery failures at Netflix. The company was transitioning from a DVD mailing company to a streaming company. The Netflix leadership team very publicly announced its commitment to adopt AWS cloud services and abandoned its own data centers. This was a big idea, since Amazon just rolled out the service two years before, and it wasn't what anybody would claim as mature yet. The Netflix precipitating event was a database failure that prevented the company from delivering DVDs to its customers for three days and then later that year, around Christmas, AWS suffered a major outage that prevented Netflix customers from using the new streaming service.

Rick Howard: In response, Netflix engineers developed their first chaos engineering product in 2010 called chaos monkey that helped them counter the vanishing instance problem caused by the AWS outage. With that success, Netflix began building its own chaos engineering team and wondered if it could scale. If the team could fix the small scale vanishing instance problem, could they do the same at the vanishing region scale? In fairness, Netflix wasn't the only company thinking along those lines. In 2006, Google's site reliability engineers, SREs, established their own disaster recovery testing, DiRT program, to intentionally insert failures into their internal systems to discover unknown risks. But their cool name for it, DiRT, wasn't as hip as the Netflix name, Chaos Monkey, and it didn't catch on, the idea was similar, though.

Rick Howard: By 2011, Netflix began adding new failure modules that provided a more complete suite of resilience features. Those modules eventually became known as the Netflix Simian Army and included colorful names like Latency Monkey, Conformity Monkey, and Doctor Monkey, just to name three. There are many more. Netflix shared their source code for Chaos Monkey on GitHub in 2012, and by 2013, other organizations started playing with the idea, like Capital One, Google, Slack, Microsoft, and LinkedIn. To understand why chaos engineering is required by these global service providers, you must first accept the fact that we no longer live in a linear digital world. When the internet emerged as a useful business tool around the 1990s, things were pretty simple. We didn't think so at the time, but compared to today, that world was kindergarten.

Rick Howard: If you change one thing in that world, You pretty much knew what was going to happen. But today's IT environments are systems of systems. We're in PhD land here. They're complicated and most of us have no idea how they actually work and what the real dependencies are between all the software modules deployed on all of our data islands. According to Rosenthal, Jones, and Oshbacher in their book, Chaos Engineering: System Resiliency in Practice, “A change to input of a linear system produces a corresponding change to the output of the system. Nonlinear systems have output that varies wildly based on changes to the constituent parts.”

Rick Howard: It's like that old chestnut that when a butterfly flaps its wings in China, you might end up with a hurricane in the Gulf of Mexico. When the hard drive of a system running a non essential monitoring app in an AWS region in North America fails but somehow causes a system wide failure, this is what I'm talking about. These systems are complicated, and humans can't possibly understand all the permutations in their head. Software engineers think they know, and DevOps and SRE teams write linear regression tests for things they assume to be true. But those teams don't learn anything new by doing so. They test properties of the system that are already known, like previously corrected defects and boundary conditions of the main features of a product. Rosenthal, Jones, and Oshbacher say that these kinds of linear regression tests require that the engineer writing the test knows specific properties about the system that they're looking for.

Rick Howard: Chaos engineering, in contrast, is the pursuit of the unknown. They don't replace the linear regression tests. They are trying to solve a different problem by uncovering unknown and, as yet, undiscovered design faults. Chaos engineering is built on the scientific method. DevOps teams develop a hypothesis around steady state behavior and run experiments in production to see if the hypothesis holds. If they discover a difference in steady state between the control group and the experimental group on production systems, then they have learned something new. If not, they have gained more confidence in their hypothesis. They use techniques to minimize the blast radius on the production system and monitor the entire experiment carefully to ensure no catastrophic effect but they have to be on the production system to do it.

Rick Howard: Nerd Reference: Dr Richard Feynman, the brilliant Nobel Prize winning teacher communicator scientist, gave a lecture at Cornell University in 1964, where in one minute he gave the most clear and simple definition of the scientific method.

Dr Richard Feynman: I'm going to discuss how we would look for a new law. In general, we look for a new law by the following process. First, we guess it, then we com, no, don't laugh. That's the really true. Then we compute the consequences of the guess. to see what, if this is right, if this law that we guessed is right, we see what it would imply and then we compare those computation results to nature. Or, we say, compare to experiment or experience. compare it directly with observation to see if it, if it works. If it disagrees with experiment, it's wrong. And that simple statement is the key to science. It doesn't make a difference how beautiful your guest is, it doesn't make a difference how smart you are who made the guest, or what his name is. If it disagrees with experiment, it's wrong. That's all there is to it.

Rick Howard: Word Notes is written and edited by me, Rick Howard, and the mix, sound design, and original music have all been crafted by the ridiculously talented Elliott Peltzman. We're privileged that N2K and podcasts like Word Notes are part of the daily intelligence routine of many of the most influential leaders and operators in the public and private sector, as well as the critical security teams supporting the fortune 500 and many of the world's preeminent intelligence and law enforcement agencies. N2K strategic workforce intelligence optimizes the value of your biggest investment, people. We make you smarter about your team while making your team smarter. Learn more at N2K.Com and thanks for listening.

HOST(S):

Rick Howard is the CSO of N2K and the Chief Analyst, and Senior Fellow at the N2K Cyber, formerly CyberWire. His past lives include CSO at Palo Alto Networks, CISO at TASC, the GM at Verisign/iDefense, the Counterpane SOC Director, and the Commander of the Army's Computer Emergency Response Team (CERT). Rick served 25 years in the Army, taught computer science at West Point, edited two books and just published his own book, "Cybersecurity First Principles: A Reboot of Strategy and Tactics" and he is regularly joined at the N2K Cyber's Hash Table by a collection of industry experts.

Schedule: Tuesdays

Creator: CyberWire, Inc.