Sep 30, 2024

CSO Perspectives is a weekly column and podcast where Rick Howard discusses the ideas, strategies and technologies that senior cybersecurity executives wrestle with on a daily basis.

Security remediation automation.

Listen to the audio version of this story.

Rick Howard, N2K CyberWire’s Chief Analyst and Senior Fellow, turns over hosting responsibilities to Rick Doten, the VP of Information Security at Centene and one of the original contributors to the N2K CyberWire Hash Table. He makes the case to invigorate the automation first principle cybersecurity strategy. In this case, he is specifically addressing remediation automation.

Today’s infrastructures are so complex and dynamic, that if we are still relying on humans for configuration updates, patches, and vulnerability remediations, we will never get ahead. Remediation is a journey and not a destination. In cloud workloads, there are never just hundreds of vulnerabilities or configuration changes, there are thousands or tens of thousands. We need to be able to automate to scale remediation while not impacting the organization.

This will take a combination of solid governance, process, and supporting technology. I’m specifically leaving people out, because those are who we are supporting, and essentially governance is of the people, by the people.

There are several new tools coming out that support automating remediation workflows, and some leveraging AI to determine what to fix, how to fix, and automatically remediating. But those only can be effective, without potential negative impact, when there is a process in place that has automated Q/A and testing gates at scale. And yes, I understand I’m advocating this topic in the wake of Crowdstrike; but I will talk about that event later.

Many automatic alerts, manual interventions.

Our main challenge is that we have no shortage of tools that find problems, whether vulnerability scanning, posture management, application code scanning, asset inventories, attack surface management and the like; we have lots of things to tell us we need to fix things. But, the problem is the security team doesn’t usually fix things. The IT department fixes things. And frankly IT often resents security teams for continually tossing reports of what they need to fix over the fence. That resentment also is due to giving them more work to parse the report to figure out what they need to do. Or these reports don’t align with the change control process IT uses. The result is IT gets overwhelmed. Moreover, these reports usually have little context of what is the organizational risk, only the built-in scanning tool severity ratings. A severe finding in a publicly facing system, where the vulnerability has a known exploit, is different from a severe finding on a development system in a test lab deep inside the organization.

Some vulnerability tools help by creating tickets for them, but I know one case where the tickets were put in security team’s ServiceNow instance, while the IT team used JIRA. So, they were told to get accounts and go there to get their tasks. Or likely just create scripts to pull from one to insert into the other. Why are we adding more work into the remediation process?

Because on the IT side, once given the findings, they need to research what the impact is, prioritize, determine which team is responsible, assign the ticket in their platform (which might vary if the problem is a code change or an infrastructure change), then find or create a remediation. That step may involve finding and downloading a patch, or researching a code or configuration change; then determine if there are dependencies that this fix might break. Then finally, they need to test the fix, and push it out at scale.

There are new remediation workflow tools that support prioritization, normalization and de-duplicating of findings to route them to the appropriate team, and even create tickets to assign to specific people. You can do a lot of that today with SOAR tools (Security Orchestration, Automation, and Response tools) if there is a process and workflow to support that automation. And while that is great, and brings tremendous improvement, it only gets us so far.

Everything is ad hoc

Another step within this remediation workflow process is the risk assessment to decide to accept or mitigate the risk (not much can be transferred, unless there is a compensating control by another platform). Unfortunately, many risks are accepted because they can’t physically remediate without impact to the systems, either due to certain compatibility requirements on applications, resource requirements, performance or latency limits, or for stability reasons. But often, the problem is the basic lack of resources. There aren’t enough people to fix all the things, so they only focus on the most important remediations. Hopefully the process was effective in prioritizing the right ones in the first place.

20 years ago, when I ran ethical hacking teams, we used to do both external and internal network testing. After a while, customers would opt-out of the internal network testing. Many customers told me they didn’t want to know about internal findings, because if it was documented, they would be on the hook to eventually fix them. So, they accepted the risk based on their being “inside the firewall.” Doing testing on a handful or couple dozen externally facing systems would produce a dozen or so findings. But against hundreds or thousands of workstations, servers, and network devices, it would generate thousands of findings that they didn’t have the resources or time to fix. And at that time, there wasn’t a mature process to prioritize; we just relied on the severity levels from the tools.

We also had to argue with the IT department when we reviewed the findings to convince them they are real issues to fix. Many IT shops were used to their security teams scanning tools that produced numerous false positives. Doing vulnerability testing or pen testing by hand is very different from using automated tools to scan. This is because there is a human in the loop to verify these are real findings. We had a hierarchy of testing: at the bottom we had automatic vulnerability scanning and auto generated reports. Next was vulnerability testing, which used automated tools, but then findings were manually verified by the human, and evidence collected. Then Penetration testing (or ethical hacking), which after a vulnerability test, the human would then exploit any findings to see how far they could get. Pen testing is similar to Red Teaming, but Red Teaming is where you are mimicking a specific adversary’s capabilities against a specific target.

So to convince the IT team it isn’t a false positive we would include screenshots of success as evidence. But even then, for some, that wasn’t enough because they insisted that couldn’t be true because they patched that system, or they accused us of photoshopping the picture to make them look bad. So in those cases we would have to drop a file on the system that read “we were here.” It was always fun when after a few minutes of protest that that FTP vulnerability wasn’t possible to exploit externally, they look at the directory and see our drop file that proves they were wrong.

We also would come into a policy roadblock, or loophole. I remember we did an external pen test for a State Government network (I won’t say which even after 20 years). After we gave them the report, they created a remediation plan, which they let us review and approve. The next year, as per their policy, we tested them again, and found the exact same findings; it was like nothing changed. When asked about this in the outbrief, they indicated that their policy was to create a remediation plan after the Pen test. There just wasn’t any requirement to actually fix things. So, they did only what was required, because they didn’t have resources to do the fixing.

That was a shock.

In subsequent years, when scoping a pen test, I always asked, “is your intent to fix these findings? Or just create a remediation plan to fix them to satisfy your policy?” I was amazed how many answered that they just needed to test and not fix findings.

I got to experience this firsthand 10 years ago when I was CISO at a midsized multinational company. We had a tiny security team, and a small IT team of less than 8 people. Most of our process was ad hoc. I found I spent most of my time convincing them they needed to fix things, but could see they only had 1-2 people with the expertise to do it on different platforms. And some of our findings took some remediation research to come up with a fix, then script how to fix it. But, they didn’t have any expertise to do that at all. So, I had one of my security guys get the details, create the script package for them to run. Otherwise, we would never have anything fixed. I wish SOAR was invented back then.

How do we automate this mess?

So how do we automate? I’m fortunate enough to talk to more vendors and startups than most people, so I get to see new categories of tools develop and how different startups approach the problem. First, as I mentioned, there are workflow support tools. These will crowdsource findings from the scanners (network, application, posture management of containers, bug bounty, DLP, data, identities, etc.) de-duplicate and group them together, determine who is responsible for remediation, what workflow process they use, develop priority based on business impact (not platform or CVE severity rating), and create a ticket in the appropriate ticketing system. Some also can batch them to not send all 100 findings at once, but queue them to only push 20 a week if needed to throttle. Others include remediation steps for a person to take. And others will read other tickets to find similar issues to highlight that this has been fixed elsewhere before, and this is how it was done. And because they have visibility into both sides, ticketing system and testing source, if a finding is seen again, they can just increment the open ticket that it was first seen, instead of creating a new ticket. Their interface can track the remediation status based on which teams, platforms, or business areas are being reviewed, sending our reminders to them, and measuring teams to SLAs. These give a central picture of the process.

The next level of tooling are the remediation automation tools. I’m not talking about SOAR, where humans script a workflow based on previous experience. But ones that automatically develop the remediation steps, be it code updates, configuration scripts, maybe compensating controls if a patch isn’t out yet. They can be configured to create a ticket with a button to fix the problem, or fix and then create a ticket saying the problem was remediated, and a button to back out of the fix if there was a problem.

This is where teams start to get nervous. We are very scared of automated updates to their systems, because it could go bad, or really bad like the Crowdstrike incident in July 2024.

War stories.

Before talking about that incident, let me tell you a story from almost 25 years ago when I was a pen tester. We used to invite our customers in while we did testing. Back then, pen test teams working for large accounting firms weren’t transparent about their process. I did warn them it was pretty boring, because real testing wasn’t like movie hacking where you are frantically typing on a keyboard like you are fighting a video-game boss battle. It’s about collecting, verifying, and entering a target range, configuring the script or tool for how you want to scan, and running it. Then you sit and wait. There was a lot of sitting around early on in pen testing.

So, this one time, I had a customer security director sitting with me while I scanned his external network. After about 5 minutes, the scan failed, which wasn’t unusual, so I set it up again, and restarted. I then got a destination error. He told me “ah, our network shunned you.” But because he said the word “shunned,” I knew that was a specific term from the Cisco Intrusion Prevention tool: NetRanger. Which was one our first attempts years ago at auto remediation. I asked him if it’s okay if I shut down his network. And he first was confused how I would do that, but said, “sure, I can tell my team to reset; this is a pen test and they know this could happen.” Now in tests I did in the future, I would have the customer put it in writing for attempting, but this time just took his word for it.

For you techies listening, I used the famous NetCat tool to set the source address of my NMAP scan to be his company’s upstream gateway router. Afterwards, the IPS saw the scan originating from the default gateway router, tagged it as malicious (because why would the gateway router be scanning us?) and shunned it. After about 2 min, my client’s phone rang; it was his team saying their internet went down. He told them what happened and they removed the block in about 5 minutes to restore service. I always remember that experience, and always distrusted IPS ever since.

Another other history lesson I won’t go into is about the failed security company Norse from 10 years ago, which put IPS in front of their customers’ firewalls with intent to automatically blocked bad websites based on their threat intel data. They were the ones with the pew-pew world chart that showed “attacks” from different parts of the globe. But that is a podcast story for another time--if you know, you know.

Let’s talk about Crowdstrike.

Getting back to Crowdstrike, while many blamed them and others blamed Microsoft, I look at it as a failure of our internal change control process. Yes, Crowdstrike shouldn’t have published an update that blue-screened its customer’s devices; no argument there. But Crowsdstrike customers have some blame too. Why are we deploying software upgrades without testing?

Those of us old enough to remember that 20-25 years ago, we would burn in new patches for 3-5 days before pushing out widely. Then in 2002, Microsoft started their trustworthy computing initiative, started threat modeling software, created patch Tuesday, and in turn released more stable patches. Over the years, we got comfortable with just accepting patches and updates from vendors, especially malware signature updates. Eventually, most teams dropped that burn in step from their process. And while Crowdstrike wasn’t the first or only time boxes were bricked in the last 15 years due to update, it was certainly the largest.

I would wish most had just let it update a small number of devices first for burn-in testing, even if just for 1 hour before opening to all. I understand 99% of organizations don’t have a change control maturity to set that up, but at least do it for most critical systems. I’m sure we can wait a bit before potentially impacting our critical business systems.

A Cambrian explosion of temporary software.

That “test first” perspective is what I’m talking about with automated remediation; not blindly fixing things because we can--have that Q/A step in the process. And it is easier to do today. There are automated tools coming out that use AI agents to do regression testing, compare results of current and post-remediation performance, and publish that evidence when they create the ticket with the code and script to remediate. Twenty years ago we couldn’t just spin up a device, run a script, then tear it down. We had to buy or find hardware, find a place to put it, plug it in, set it up, install software, etc etc. Now we spin up systems continuously throughout the day on our cloud workloads.

Another thing that has changed is the number of sources of vulnerability information. Back in the day, we had vulnerability scanners; just one thing that was run ad-hoc or on some monthly schedule that produced a long report of things to fix. Today, we have added posture management tools looking at anything running compute or storage in the cloud, as well as for identities, APIs, and temporary workloads in containers. This is generating exponentially more things to fix.

The modern continuous integration / continuous deployment CI/CD pipeline links together the design, coding, testing, deployment, runtime, maintenance, and sun-downing of an application seamlessly and in real time. The environment is so dynamic that there are more temporary workloads running that might only be up for minutes or hours. By the time the human gets a report that something might be wrong, that system is already long gone.

Today, developers create infrastructure and applications that use service accounts connecting to multiple data sources via a galaxy of APIs. Everything is more collaborative than the applications of old. Back then, developers built single applications, then separated client and server, then added web interfaces and middleware to applications, and then created a distributed service oriented architecture that appears from the outside like a mesh of cloud chaos.

Automation as a First Principle Strategy: RemSecOps.

So how do we make sure we comprehensively assess all the things in all the places? Well, there are now layers of tools that leverage even more service accounts to access APIs into everything from code repositories, servers, data stores, identity platforms, network and security devices. You can now crowd source that information to get visibility into what is going on each platform. And each of these tools creates findings to remediate. There is a lot of overlap with what they are seeing, since you might have 4 different tools looking at the same management platform, like Terraform, creating duplicate findings or the same finding on every temporary container that is spun up. The responsibility to fix all these now crosses over into many teams who might have different remediation workflows and ticketing systems. So the better tools have to be smart enough to know that code findings will go to something like JIRA, and infrastructure finding to something like ServiceNow. This reinforces the need for these workflow management tools. I know we hate to add another layer to technology, especially if we aren’t taking something away, but we need it to bring all these workflows together.

So this brings me back to my original premise, we have to automate, we have to have intelligence to help route, de-duplicate, prioritize, group, develop remediation script, test, create ticket, assign, and feed into the configuration change control process. Because for some systems, there might be SLAs for remediation, for others, different quality gates and management approvals. And others might even have to route to a steering committee for approval. You don’t want to change that, but we do want to automate as much as we can to speed it up. We have DevOps, DevSecOps, now we need RemSecOp “remediation security operations”. I’m not trying to coin a new term, just describe the automation of a process, like the previous terms have done.

There are tools that can help us. Those tools must align with your process and departmental workflows, and we need to get comfortable in letting systems automatically fix things for us. Not blindly, they will still create a ticket, and get approvals as needed, but organizing the research and creating fixes can all be automated to help humans be more effective. This leaves them time to continue to work on governance and process, refining collection sources and data quality, and focusing on the more nuanced troubleshooting that machines aren’t good at.

Thank you so much for listening to my rant about automated responses. I think this is the future, but we need to go into it smartly yet very consciously.