CSO Perspectives (Pro) 9.30.24
Ep 121 | 9.30.24

Security remediation automation.

Transcript

Rick Howard: Hey, everybody, welcome back to season 15 of the "CSO Perspectives" podcast. This is episode three where we turn the microphone over to some of our regulars who visit us here at the N2K/CyberWire "Hash Table". You all know that I have a stable of friends and colleagues who graciously come on the show to provide some clarity about the issues we are trying to understand. That's the official reason we have them on the show. In truth though, I bring them on to hip-check me back into reality when I go on some of my more crazier rants. We've been doing it that way for almost four years now. And it occurred to me that these regular visitors to the "Hash Table" were some of the smartest and well-respected thought leaders in the business. And in a podcast called "CSO Perspectives," wouldn't it be interesting and thought-provoking to turn the mic over to them for an entire show? We might call it "Other CSO Perspectives." So that's what we did. Over the break, the interns have been helping these "Hash Table" contributors get their thoughts together for an entire episode of this podcast. So, hold on to your butts.

Unidentified Speaker: Hold on to your butts.

 

Rick Howard: This is going to be fun. [ Music ] My name is Rick Howard and I'm broadcasting from the N2K CyberWire's secret Sanctum Sanctorum Studios located underwater somewhere along the Patapsco River near Baltimore Harbor, Maryland in the good old US of A. And you're listening to "CSO Perspectives," my podcast about the ideas, strategies, and technologies that senior security executives wrestle with on a daily basis. [ Music ] Rick Doten and I have been friends forever. And he is a man of many talents, bartender, yoga instructor, boxer, YouTube host, rock climber, and foodie. In his past life, he judged the annual national pie championships, where amateur professional and commercial pie makers compete in their categories for the best pies in the country. How great is that? And he is also a world-class cybersecurity mind. He's so smart that we've had him on the show 14 times. In his current gig, he is the VP of Information Security at Centene, ranked 22nd in the Fortune 500 list this year. He advises a boatload of cybersecurity startups and knows practically everybody who is anybody in the InfoSec profession. He's a big deal. For this show, he's taking on the Cybersecurity First Principle Strategy of Automation, to specifically talk about security remediation automation. And at the end of the show, when he gets done, I'll come on and ask him a few questions about it. Here's Rick Doten. [ Music ]

 

Rick Doten: Thanks for that great introduction, Rick. Wow, that's a trip down memory lane. Hey, everybody, my name's Rick Doten, and I'm so happy to be talking about this topic today. And that's because today's infrastructures are so complex and dynamic that if we're still trying to rely on humans for configuration updates and patches and vulnerability remediations, then we're never going to get ahead. Remediation is a journey, not a destination. And in cloud workloads, there's never just hundreds of vulnerabilities or configuration changes. There are thousands or tens of thousands. We need to be able to automate this to scale remediation. Now, while not impacting the organization, this will take a combination of solid governance process and supporting technology. Notice I'm specifically leaving the people out because those are who we're supporting. And essentially, governance is of the people by the people, right? There are several new tools coming out that support automated remediation workflows and some leverage AI to determine what to fix, how to fix, and others that will automatically remediate. But those can only be effective without potentially negative impact when there's a process in place that has automated QA and testing gates at scale. And yes, I understand that I'm advocating for this topic in the wake of CrowdStrike, but I'll talk about that event later. Our main challenge is that we have no shortage of tools that find problems, whether vulnerability scanning, posture management, application code scanning, asset inventories, attack surface management, all those things. We have lots of things that tell us we need to fix things. But the problem is the security team doesn't usually fix things. It's the IT department that fixes things. And frankly, IT often resents security teams for continuously tossing reports of what they need to fix over the fence. That resentment also is due to giving them more work to parse the report and figure out what they need to do. Or these reports don't align with the chain control process IT uses. So as a result, IT gets overwhelmed. These reports usually have little context of what the organizational risk is, only often leveraging the scanning tool severity ratings. So severe finding in a publicly facing system where the vulnerability has a known exploit is very different than a severe finding in a development system in a test lab deep inside the organization. Some vulnerability tools can help by creating tickets for them, but, you know, I know one case where tickets actually were just put into the security team service now instead, where the IT team use a different platform. So the IT team was forced to create accounts in that new system or script out accounts to pull what they need to do, again, adding more work into the remediation process. Because on the IT side, once you're given the findings, that's to start. They still need to research what the impact is, prioritize, determine which team is responsible, assign a ticket to their platform, which may vary depending on if, for instance, it's a code change or an infrastructure change, and then find or create the remediation. This step may involve finding or downloading a patch or researching a code or some other configuration change. Then they need to determine if there's dependencies that this fix might break. And then finally they actually test the fix and push it out at scale. There are new remediation workflow tools that support prioritization, normalization, deduplicating of findings to route them to the appropriate team, and even create tickets to assign to specific people. You can do all that today with SOAR tools, Security Orchestration, Automation, and Response tools, but that's only if there's a process and workflow to support that automation and you've already implemented it. And while these are all great and bring tremendous improvement, it only gets us so far. Another step within this remediation workflow process is the risk assessment to decide to accept or mitigate this risk. Not much can be transferred unless there is compensating control on another platform. So unfortunately, many of these risks are accepted because they can't physically remediate with impact to the systems, either due to either compatibility requirements on the applications, resource requirements, performance or latency limits, or stability reasons. And often the problem is just a basic lack of resources. There aren't enough people to fix all the things. So they only focus on the most important remediations and hopefully, the process was effective enough in prioritizing the right ones in the first place. Twenty years ago, I ran ethical hacking teams and we did both internal and external network testing. But after a while, customers were just opted out of the internal testing because as customers told me, they didn't want to know about the internal findings because if it was documented, they would be on the hook to eventually fix it. So you just accepted the risk based on it being inside the firewall. When we're doing testing on just a handful or a couple dozen externally-facing systems, that would only produce a dozen or so findings. But against hundreds or thousands of workstations, servers, and network devices, that would generate thousands of findings that they didn't have the resources or the time to fix. And at that time, there wasn't a mature process to prioritize. We just all relied on the severity levels from the tools, like I mentioned before. We also then had to argue with the IT department when we were reviewing the findings, convinced them there were actually real issues to fix because many of the IT shops were used to their security teams using scanning tools to produce a whole bunch of false positives. So doing vulnerability testing or pen testing by hand is very different than using just automated tools and letting it produce the report. This is because there's a human in the loop to verify these are real findings. Back then we had a hierarchy of testing. At the bottom were those automated, vulnerability scanning tools that automatically regenerated reports. The next level was vulnerability testing, which used automated tools, but then the findings were manually verified by the human, and evidence was collected that it existed. And then the top level was penetration testing or ethical hacking, which after the vulnerability testing and human-verified it, the human would then exploit the findings to see how far they could get. Pen testing is similar to red teaming, but red teaming is where you are mimicking a specific adversary's capabilities against a specific target. So, to convince the IT department this isn't a false positive, we would include screenshots of success as evidence. But even then, for some, that wasn't enough because they would actually insist that it couldn't be true because they already patched that system. Or they actually literally accused us of photoshopping the picture to make them look bad. So in those cases, we would have to drop a file on the system that read, "We were here." And that was always fun. After a few minutes of protest from the IT team that that FTP vulnerability wasn't possible to exploit externally, they'd have to look in the directory and see our drop file that proves they were wrong. We also would come into policy roadblocks or loopholes, really. I remember we did an external pen test to a state government network. I won't say which, even if it was 20 years ago. After we gave them a report, they created a remediation plan, which we reviewed and approved. Then the next year, as per their policy, we tested them again. But we found the exact same findings. It was like nothing changed. When we asked about this in the out brief, they indicated that their policy was to create a remediation plan, not actually fix things. So they just did what was required because they didn't have the resources to actually fix it. I was in shock. I'm now not surprised by anything. So in subsequent years, when scoping pen tests, I would always ask, is it your intent to actually fix these or just create a remediation plan? So that was a lesson learned. I was always amazed how many people answered that they just had to test it and create the plan, not actually fix it. I got to experience this firsthand 10 years ago when I was a CISO of a mid-sized multinational company. We had a tiny security team and a small IT team of like less than eight people. Most of our process was ad hoc, but I still, I spent most of my time convincing them they needed to fix things, but could see they only had one or two people with the expertise to do it on different platforms. And some of our findings took some remediation research to come up with a fix and then script how to fix it. And they didn't have the expertise to do any of that. So I had one of my guys do it. They'd get the details, create the script, package it, and run it for them, otherwise, we'd never get anything fixed. Boy, I wish SOAR was invented back then. So how do we automate? I'm fortunate enough to talk to more vendors and startups than most people, so I get to see new categories of tools developed and how different startups approach different problems. First, as I mentioned, there's the workflow support tools. These will crowdsource findings from the scanners, whether it be to the network, the applications, the posture management tools, bug bounty groups, DLP data, etc., etc., deduplicate and group them together, determine who's responsible for remediation, what workflow process they use, develop a priority based on business impact, not the CVE severity rating, and then create a ticket in the appropriate ticketing system. Some tools also will batch them as to not send them all 100 findings at once but queue them up to push only like 20 a week to throttle if needed. Others will include steps for a person to take and still others will read the tickets to find similar issues to highlight that this has been fixed elsewhere before and this is how it was done. And because these tools have visibility into both sides, the ticketing system and the testing system, that if they see a finding again, they can just increment an open ticket that was already seen first, instead of creating a new ticket. That interface can then track the remediation status based on what teams, platforms, or business series are being reviewed, sending reminders to them and measuring the teams based on SLAs. These give a central picture of the process. The next level of tooling are the remediation automation tools. Now I'm not talking about SOAR where the humans actually script the work based on previous experience, but ones that generate the automation themselves through AI, be it specific code updates, configuration script, or maybe compensating controls if a patch isn't out yet. They can be configured to create a ticket with a button to fix the problem or just fix it and create a ticket saying the problem was remediated and then have a button to back it out if there was a problem with it. This is where teams start to get nervous though. They're always very scared with automated updates to our systems because if it goes bad it can go really bad, just like the CrowdStrike incident in July of 2024. Before talking about that incident, let me tell you a story from almost 25 years ago. When I was a pen tester, we used to invite our customers in while we did the testing. Because back then our competitors working in the large accounting firms weren't transparent about their process. So it was a benefit for customers for the small firm that I work for to actually come in and see everything that we did on the network. But I'd warned them, it was pretty boring. Because real testing is not like it was in the movie. Movie hacking is where you're frantically typing on a keyboard like you're fighting a video game boss battle. Real testing is about you're collecting, verifying, and entering all the target information, you configure the script or the tool, and then you run it. And then you sit and you wait. There's a lot of sitting around and waiting for things to happen or break and then restarting them. So, this one time, I had a customer security director sitting with me while I scanned his external network. After about five minutes, a scan failed, which as I said is not unusual, so I set it up again and restarted. But then I got a destination error. He told me, "Ah, our network shunned you." But because he said the word shunned, I knew that was a specific term from the Cisco intrusion prevention tool's NetRanger, which was our first attempt years ago at automated remediation, like back in 2000, 2001. I asked him, "Hey, is it okay if I shut down this network?" And at first, he was confused. He's like, "How are you going to do that?" And I said, "You just wait and see." He said, "Sure, I can just tell my team to reset this. It's a pen test, they know this could happen." Now, in tests I did in the future, I would have had the customer put that in writing before attempting. But this time I just took his word for it. For you techies listening, I used the famous Netcat tool to set my source address of my in-map scan to be his company's upstream gateway router. Afterwards, the IPS saw the scan rigging from that default gateway router, tagged it as malicious because why would the gateway router be scanning us and shunted? After about two minutes, my client's phone rang. It was his team saying their internet went down. He told them what happened and they removed the block in about five minutes to restore service. I'll always remember that experience and I'll always distress IPS ever since. Another history lesson I'll be quick with because I won't go into details was about the failed security company Norse from about 10 years ago. What they did is they put an IPS in front of customers' firewalls with the intent to automatically block bad websites based on their threat intel data. They were the ones with the Pew Pew World chart that showed attacks coming from different parts of the globe. But that's a podcast story for another time. If you know, you know. See the notes below if you wanted to take a deep dive into the Norton Pew Pew charts. Getting back to CrowdStrike, while many blame them and others blame Microsoft, I look at it as a failure of our internal chain control process. Yes, CrowdStrike should not have published an update that blue-screened its customers' devices. No argument here. But CrowdStrike's customers have some blame too. Why are we deploying software updates or even threat signature updates without testing? Those of us old enough to remember that 20 to 25 years ago, we would burn in new patches for three to five days before pushing out widely. Then about 2022, Microsoft started their trustworthy computing initiative. They started threat modeling software, created Patch Tuesday, and in turn, released more stable patches. So over the years, we got comfortable with just accepting patches and updates from vendors, especially the malware signature updates. Eventually, most teams dropped that burn-in step from the process. And while CrowdStrike wasn't the first or only time boxes were bricked in the last 15 years due to an update, it certainly was the largest. I wish that most had just let the update go on a small number of devices first to burn in, even if it's just for an hour before opening up to everything. Now, I understand that 99% of organizations don't have change control maturity to set that up, but at least do it for the most critical systems. I'm sure we can wait an hour before potentially impacting our critical business systems. That test-first perspective is what I'm talking about with automated remediation, not blindly fixing things because we can, have that QA step in the process. And it's easier to do today because there are automated tools coming out that use AI agents to do regression testing. Compare the results to current and post-remediation performance and publish that evidence when they create the ticket with the code or script to remediate. Twenty years ago, we just couldn't spin up a device, run a script, and then tear it down. We had to go buy or find hardware, find a place to put it, plug it in, set it up, install the software. Now, we can spin up systems continuously throughout the day on our crowd workloads. Another thing that has changed is the number and source of vulnerability information. Back in the day, we just had vulnerability scanners. Just one thing that was run ad hoc or on some monthly schedule that produced a long report of things to fix. And today, we have added posture management tools, looking at anything running on compute or storage in the cloud, as well as for identities, APIs, and temporary workloads. This is generating exponentially more findings to fix. The modern continuous integration, continuous deployment, or CI/CD pipeline links together this design, code, test, deploy, run, maintenance, and sundowning process of applications seamlessly and in real-time. The environment is so dynamic that there is more temporary workloads running that might only be up for a few minutes or hours. And by the time the human gets to report that something might be wrong, that system's already gone. Today, developers create infrastructure and applications that use service accounts connecting to multiple data sources via a galaxy of APIs. Everything is more collaborative than the applications of the old. Back then, developers originally just built single applications and then evolved to separating the client and the server, and then evolved to adding web interfaces and middleware to the applications, and then eventually to distributed server architectures that appeared from the outside like a whole mesh of cloud chaos. So how do we make sure we comprehensively assess all the things and all the places? Well, there are now layers of tools that leverage even more service accounts to access APIs into everything from code repositories, server and data stores, identity platforms, network, and security devices. You can now crowdsource all that information to get visibility into what's going on in each platform. But each of these tools creates findings to remediate and there's a lot of overlap with what they're seeing. Since you may have four different tools looking at the same management platform like Terraform, creating duplicate findings or the same finding on every temporary container that's spun up. The responsibility to fix all these now crosses over into many teams who might have different mediation workflows and different ticketing systems. So now the better remediation tools have to be smart to know that a code finding may go to something like Jira and infrastructure finding may go to something like ServiceNow. This reinforces the need for these workflow management tools. And I know we hate to add another layer of technology, especially if we aren't taking something away, but we needed to bring all these workflows together. So this brings me back to my original premise. We have to automate. We have to have intelligence to help route, deduplicate, prioritize, group, develop remediation script, test, create a ticket, assign, and feed into that configuration change control process. Because some of these systems, there might be SLAs for remediation. Others may have quality gates that need to be met and management approvals. And others might even have to route to steering committees before anything could be approved. So you don't want to change that, but we do want to automate as much as we can to speed it up. We have DevOps and DevSecOps, now let's create RemSecOps, remediation security operations. I'm not trying to coin a new term. I'm just describing the automation of a process just like the previous terms have done. There are tools that can help us. These tools must align with your process and departmental workflows, and we need to be comfortable in letting systems automatically fix things for us. Not blindly, we still need to create a ticket and get approvals as needed, but organizing the research and creating fixes can be automated to help humans be more effective. This leaves them more time to continue to work on the governance and process, refining the collection sources and data quality, and focusing on the more nuanced troubleshooting that machines and AI aren't good at. [ Music ]

 

Rick Howard: So, Rick, that was fantastic. How was your first experience of being a podcast host? I know you do your own YouTube stuff, but I think this was your first audio podcast, right?

 

Rick Doten: Yeah, it's different to read. I mean, when I do a lot of my videos, I may script it out and I have a little teleprompter, but to do it blindly into a microphone is kind of different. So I appreciate-- also it was helpful that you gave the tip of shortening the sentences and the edges. So like fewer words on the screen and you're kind of rapidly scrolling up as you're reading. But no, that was a good tip, but I enjoyed it. And it was interesting that I didn't start making mistakes until later. You can see how that, you know, you don't have the stamina for it, you start like slipping up 20 minutes in.

 

Rick Howard: That is very true. I do that myself as you get-- you know, it's hard to focus into a microphone for, you know, a good bit of time. And I did notice that about yours, as I do with mine, you know, you start to slur your words and try to go faster because you're tired of doing this. Yeah, that's all great. And the other thing that Rick mentioned was, I found out early when I started my podcast that it's hard to read an essay when you write it in formalities, you know, with sentences and periods and things like that. Most people don't talk like that. So it's easier to break up a sentence into chunks on the way you speak. And so, I passed those recommendations on to Rick and I'm glad you found it useful, my friend.

 

Rick Doten: Yeah, I was actually skeptical at first because I had done it, you know, like on a teleprompter with, you know, just putting my document on there and reading it, but I found it did work better. And that's probably how the real teleprompters use like shorter, you know, like three or four words on a screen.

 

Rick Howard: Well, I was so pleased that you picked this topic, security remediation automation. You know, I fundamentally believe that if the absolute first principle of any security organization is to bite down risk, then automation is the most logical follow-on strategy. But I love that you added the adjectives, security remediation automation, because, like you said in the piece, we all have lots of automation available to us to find things, you know, we got commercial tools, we got programs, we write ourselves, but we have very little automation designed to fix the things we find. Well, you could say that this is a wing of the DevSecOps movement that nobody ever goes down. When the DevSecOps movement started some 10 years ago now, I boldly predicted that all of us would occupy that wing. And I was completely wrong.

 

Rick Doten: So DevSecOps was the automation of security telemetry. And like our XDR discussion, we had on the podcast last month, you know, how do I see what's really going on? But it's only half of the story, right? And that's why it's been a big topic for me the last years. I'm kind of seeing this and I'm talking to vendors who are focused on this like you're right. And even, you know, personally, you know, seeing all my peers like struggling with, you know, getting IT to fix things for them because they don't have enough information. So, yeah, we-- and that's why probably the quote that I say almost all the time now is we have no shortage of things finding problems, we have a challenge with getting them fixed.

 

Rick Howard: Yeah, exactly right. And you make the point in the piece that we have lots of these things, right, and to do, right? And, you know, you talk to a lot of security startups and so you get a feeling of what they're trying to produce, you know, for next-generation tools and things. When I talk to those companies, and I don't talk to nearly as many as you do, but when I hear their great big idea is that they can find, you know, a new kind of thing that nobody has ever found before, my reaction to that is, I don't need that. I don't have time to remediate all the things I've already discovered. I had this infinite list of things that I'll never get around to fixing. So should we be asking vendors to supply that "fix it" button too? As in we found this thing on your network, if you push this button, we'll make it go away.

 

Rick Doten: Yeah, I agree. And I think that that's the thing that we kind of-- and now that I think about it, how I got onto this topic was over the last four years. First, it was the posture management tools, and then it was the data, you know, discovery posture management tools, and then the identity. And, you know, we have all these service accounts, posture management tools. So it was like all these things finding lots of things. And the thing I would push on them was, are you linking into the right remediation workflows? Because if you just say, yeah, I'll just throw it over the fence and have people fix it, then that's not helpful. And that's what kind of helped develop that. And yes, you know, that's kind of every time I talk to a vendor I'm like, great, how are you helping me fix it?

 

Rick Howard: Well, Rick, before we go, do you have any last words on this thing? What's the big takeaway after doing your first audio podcast?

 

Rick Doten: I guess it may inspire me to like it's easy to do so maybe I'll have to start, you know, doing my own, you know, put something on and LinkedIn and figure out how to do it. I always said, I don't have my own podcast, but I'm a guest on lots of podcasts and I attribute it to you. I have the microphone, so people invite me because I sound good. I may not have anything interesting to say, though. But, yeah, and I appreciate that. I really thank you for giving me the opportunity to be able to talk about this topic, which I don't think is, you know, talked about enough. [ Music ]

 

Rick Howard: And that's a wrap. I want to thank my longtime friend Rick Doten, the Centene VP of Information Security for taking over the hosting duties for this episode and giving us the next step in the evolution of the cybersecurity first principle strategy of automation that he calls security remediation automation. "CSO Perspectives" is brought to you by N2K CyberWire, where you can find us at thecyberwire.com. And for this episode, I've added some helpful links in the show notes to help you do a more deep dive if that strikes your fancy. And don't forget to check out our book, "Cybersecurity First Principles, A Reboot of Strategy and Tactics" that we published in 2023. Automation is a key concept that runs all through that book. And, by the way, we'd love to know what you think of our show. Please share a rating and review in your podcast app, but if that's too hard, you can fill out the survey in the show notes or send an email to csop@n2k.com. We're privileged that N2K CyberWire is part of the daily routine of the most influential leaders and operators in the public and private sector, from the Fortune 500 to many of the world's preeminent intelligence and law enforcement agencies. N2K makes it easy for companies to optimize your biggest investment, your people. We make you smarter about your teams while making your team smarter. Learn how at n2k.com. One last thing, here at N2K, we have a wonderful team of talented people doing insanely great things to make me and this show sound good. I think it's only appropriate that you know who they are.

 

Liz Stokes: I'm Liz Stokes. I'm N2K's CyberWire's Associate Producer.

 

Tré Hester: I'm Tré Hester, Audio Editor and Sound Engineer.

 

Elliott Peltzman: I'm Elliott Peltzman, Executive Director of Sound and Vision.

 

Jennifer Eiben: I'm Jennifer Eiben, Executive Producer.

 

Brandon Karpf: I'm Brandon Karpf, Executive Editor.

 

Simone Petrella: I'm Simone Petrella, the President of N2K.

 

Peter Kilpe: I'm Peter Kilpe, the CEO and Publisher at N2K.

 

Rick Howard: And I'm Rick Howard. Thanks for your support, everybody.

 

Multiple Speakers: And thanks for listening. [ Music ]