Research Saturday 11.23.24
Ep 355 | 11.23.24

Exposing AI's Achilles heel.

Transcript

Dave Bittner: Hello everyone, and welcome to the CyberWire's "Research Saturday." I'm Dave Bittner, and this is our weekly conversation with researchers and analysts tracking down the threats and vulnerabilities, solving some of the hard problems, and protecting ourselves in a rapidly evolving cyberspace. Thanks for joining us. [ Music ]

Ami Luttwak: So, the Wiz research team focuses on defining critical vulnerabilities in cloud environments. And recently, we focused a lot on AI research.

Dave Bittner: That's Amy Luttwak, Co-Founder and CTO from Wiz. Today, we're discussing their research, "Wiz Research Finds Critical NVIDIA AI Vulnerability Affecting Containers Using NVIDIA GPUs, Including Over 35% of Cloud Environments." [ Music ]

Ami Luttwak: Yes, we published different research efforts that we've done, basically finding vulnerabilities in huge AI services. AI services that provide the AI capabilities to most organizations in the world like Hugging Face, like Replicate and SAP. So, we started thinking, "Okay, what could be a way an attack surface on the entire AI industry?" When we started thinking about it, we got to the software stack of NVIDIA. Right? Because we all know that NVIDIA are -- is amazing company. They have the GPUs that everyone use for AI. But a little known fact is that there is also a pretty considerable software stack that is -- it comes together with the GPUs. And that software stack is actually used by anyone using AI. So, we thought if we can find a vulnerability there, this vulnerability can affect the -- the entire AI industry. So, that's how we started looking into the NVIDIA container toolkit.

Dave Bittner: Well, we're talking today about CVE 2024-0132, which affects NVIDIA's container toolkit. Can you walk us through exactly what is involved here with this vulnerability?

Ami Luttwak: Yes. So, what is the NVIDIA Container Toolkit? It's basically a piece of software that anyone that wants to use a GPU and share the GPU across multiple users, and that happens a lot because GPUs are expensive, you would basically add to your container support for GPU, so the container itself can access the GPU and leverage the resources from the GPU. So, this container toolkit is basically used by almost anyone that builds an application of AI on top of GPUs when the application is containerized. Now, the vulnerability that we found allows the container image to escape from the container and basically take over the entire node. So that means that if the container image runs from a source that is not controlled by the -- the service provider, this container image can escape and read any secret, any file, and even execute code on the extra node that runs the GPU itself, the extra server.

Dave Bittner: Well, how could the attacker escape the container and then gain control of the host system?

Ami Luttwak: Oh, okay, so basically what we found is that this, in theory it's not possible, right? If I run a container that has no capabilities, no permissions, how -- how can it be that this container can escape and take over the entire server? So, what we found is a vulnerability within the NVIDIA toolkit that if we craft a very specific container image, right, that uses very specific features within the NVIDIA container toolkit, what it actually does is that it maps mistakenly, right? It maps to my container, which is untrusted, the entire file system of the server. It means that we can read any file from the underlying server because of this vulnerability, and we showed that once you have read access to any file on the server, we can actually run a -- a privileged container that can take over the entire server. So, this bug, this vulnerability that allowed us to map accidentally into our container the entire server file system, it also of course allows you to do full takeover if you want.

Dave Bittner: And is this specific to GPU enabled containers or I mean, are they more susceptible to this type of attack?

Ami Luttwak: So this is -- I mean this is obviously, it's wider than AI. I mean it's actually any usage of GPU. It can be for gaming. So, this basically affects almost any -- anyone using NVIDIA for containers. The reason that it's irrelevant for GPUs is just because this is the software stack that is used there, right? So, we usually wouldn't find this library when you don't have GPUs because it's a -- it's a library that allows for GPU integration. It's not actually a bug in the GPU, right? It's just a bug in the software stack that is used by most of the GPU users.

Dave Bittner: Well, what about multi-tenant environments, Kubernetes clusters, those sorts of things?

Ami Luttwak: So, the -- I think in -- in multi-tenant environments the risk is much, much higher. And this becomes a crucial risk. In the exact use case that we started the -- the research for was in environments where either you are a multi-tenant and you run -- and you allow others to run their own container images, right? In that scenario, a container image that is malicious can escape the isolation and can potentially access the other images from other users, right? So basically, in a multi-tenant environment, there is a huge risk here that this container escape vulnerability allows the attacker to get access to anyone using the AI service. And this is why we always recommend in the AI -- in the -- in the -- with research team, when you build applications, remember that containers can be escaped. So, do not trust the container as a way to isolate your tenant. So, even if you build a multi-data service, do not rely on containers. Always add another virtualization area that is stronger. And this is a good explanation here why this is so crucial. We found the vulnerability, and if you didn't build the right isolation, your service is at risk right now.

Dave Bittner: Now, and my understanding is NVIDIA recently released a patch for this vulnerability. How should organizations prioritize their patching?

Ami Luttwak: Yes. So, we worked very closely with NVIDIA, and they responded very fast, and they closed the vulnerability within a few weeks since the time we -- we disclosed it to them, and the patch was released. This vulnerability affects anyone using GPU. However, if we look at what is really crucial to fix, it's more urgent to fix areas where you allow an untrusted image to run, right? Because if you trust the image, and you know that it's not actually coming from an untrusted source, the ability for the attacker to leverage this vulnerability is highly limited. However, if you have environments where you have researchers that download untrusted images, or you have multi-tenant environments that run images from -- from users, right? These are environments that are at high risk right now. And that's what we recommend to prioritize and actually fix today.

Dave Bittner: Well, what about the various attack vectors that are possible here? I mean, are there -- are there particular attack vectors that folks should be aware of?

Ami Luttwak: Basically, container escape is just the first step of an attack. Right? But once you escape the container, you can steal all of the secrets. You can get access to any AI model on the server. You -- you can start running code on other environments. So, the container escape on its own is just the beginning of the attack. You can think about it as basically the initial access into the environment. So, if you look at the -- a classic attack, this would just be the first step. And any -- any step from there depends on the specific use case and architecture. However, what's important to understand is that many companies do run untrusted AI models, right? And we've talked about it in the past in other research that we've done. Researchers download AI models without any way to verify them. So, this risk of, "Hey, someone is running an untrusted AI model and that's -- this AI model can now escape the container because we thought it's fine to run AI models in containers. There's nothing going to happen to me." So, this assumption is not true. [ Music ]

Dave Bittner: We'll be right back. [ Music ] What are some of the other isolation barriers that people should be using here? Are we talking about things like virtualization?

Ami Luttwak: Exactly. So basically, when we design for isolation, especially for multi-tenant services, containers are not a trusted barrier. Virtual machines, virtualizations are considered a trusted barrier because if you look at the last recent years, right, how many vulnerabilities of container escape we found? How many vulnerabilities in Linux Kernel we found? There were an unnegligible [phonetic] number of vulnerabilities. However, in virtualization environments, that is very, very rare, right? And that's why as a security practitioner, when I look at the review of an architecture, a -- a virtual machine is the best way to isolate. Now, there is tools today like G-Vizor [phonetic], which is a tool that you can run that limits the ability of a workload to go outside of a specific set of approved perimeter capabilities, which reduce the risk significantly. G-Vizor is not as secure as running a full virtual machine, but it's an example of a tool that provides great isolation capabilities without changing your entire architecture.

Dave Bittner: What about organizations that might allow, let's say, third-party AI models or -- or third-party container images to be running on their GPU infrastructure? Do -- do you have any advice for them?

Ami Luttwak: Yes. So, I think that happens a lot, right? So, it happens both for AI service providers, but also for anyone that has a GPU and -- and runs -- and allows anyone in the company to run -- to run code. And that -- the implications here are first of all that you have to pitch, right. That's -- that's number one, just pitch for the vulnerability. But the wider implications are that we need to look at AI models and container images that come from third parties, just like we look at the applications that we download from, you know? Like, when you go through an e-mail and you get an e-mail from someone and you download the image, you know and I know that I would not start running the applications that I get there from the e-mail, right? Because we all know that that can be malicious. Same -- why do we trust a -- a container image that is an AI model from an untrusted source, right? We should be a bit more careful because this is code that we are running, and we need to make -- to remember that this is a new attack surface for attackers, just like downloading applications from emails. It used to be a great attack surface, but no one is, I hope, no one is clicking on emails and actually running the -- an application from an e-mail. This is going to be a new attack vector, right? Everyone talk about AI, so they just run everything that has the name AI, any AI model would be run. No, we have to remember, this is a security risk. It's a new attack vector. Anything that we run, either it's fully isolated on a separate VM and so on, or we have actual processes in the company to verify what is actually being run as an AI model and where, right? If you get a not trusted AI model, okay, you can only run it in this highly isolated environment, right? If we don't have this kind of guardrails, then we expose ourselves to a lot of risk.

Dave Bittner: You mentioned that NVIDIA was a really helpful partner here in this disclosure. Can you walk us through what that process is like? I mean, for folks who've never been through that, what -- what goes into responsible disclosure with a -- a big organization like NVIDIA?

Ami Luttwak: Great. So, the NVIDIA team, first of all, how do we engage with them? So, every company that has a product, it has a security program of how to report vulnerabilities to them. Usually, there is an incident response e-mail that is published. So, we approach that vendor, and there is a protocol that you have to follow, right? When we report the vulnerability, we do not provide anyone outside of the vendor information about the vulnerability until it's fully patched. So, the entire discussion is highly sensitive and secretive between us and the vendor. During that discussion, we try to provide to the vendor a full disclosure report with all of the information that we found. During that attempt, we usually try not to touch actual customer data of that vendor, so they don't actually get to any kind of issues with their customers. So, what we try to do as researchers is to find the problem, provide the vendor a full report, and then basically we wait. Once we send the e-mail, we just wait until the vendor actually replies to us. In the NVIDIA use case, they actually worked really fast. They provided us responses almost within a day, and they worked until they fixed the vulnerability. As I said, like, this was within two or three weeks fully patched. Now, during that time, we communicate with the vendor if they have any questions, anything that we found that they didn't know how to replicate, we help them actually reproduce. And -- and the goal, again, is to make sure that the vendor has all of the information in order to fix the vulnerability. Because it's like when we found the vulnerability and we reported it, think about it. It's like a weapon, right? Until someone actually patched it, it's very, very secret and we cannot disclose and talk about it, even with our friends, partners, customers, we cannot talk about it with anyone because I have a weapon now, and until the vendor actually finishes the -- the fix efforts, we have to remain silent on it. Now, once the vendor has a patch, our role as researchers is to explain to the world about the vulnerability, and why it's important to patch it. Now, something that's important to understand is that although we talk about it, we do not disclose yet in the beginning how the exploit actually works, right? And we do not disclose it because we want to give the good guys time before any bad guy can leverage the vulnerability. So, although NVIDIA patched the vulnerability, since we didn't disclose exactly how to exploit it, we're giving time for the good people to fix the vulnerability before anyone can actually exploit it.

Dave Bittner: And do you know if this is being actively exploited? Do you have any methods to be able to track that?

Ami Luttwak: So, there is -- there is no way to -- to know that for sure, right? We have ways because we are also a security company and we have -- we are connected to millions of workloads. So, we are actually monitoring the environments that we see for any potential exploitation. So, we haven't seen exploitation of this vulnerability in the wild yet, but it doesn't mean that it will not happen soon. And of course, our view is limited because we see only cloud environments. There is huge amounts of GPUs deployed on on-premise environments within the cloud providers. So, there is a -- our view is very Limited, and also, NVIDIA wouldn't see because this is actually happening on -- in a local GPU. Right? So, no one can tell you for sure if this is actually already being exploited. I -- I -- I do think that again, this is not a vulnerability that is easily exploited because you do need ability to build an image and then you need to publish the image. So, it takes time until this kind of vulnerability can be leveraged by an attacker.

Dave Bittner: Well, for those organizations who are running AI models in containers, what are some of the best practices they should follow to help mitigate these risks?

Ami Luttwak: That's a great question. You know, we talk so much, there's so much buzz about AI security. And many times, people talk about, "Oh, how AI is going to take over the world or how the attackers are leveraging AI to basically take over my company." But the real risk right now, the real risk right now for AI usage is the AI infrastructure they use, right? So, I mean, if you look at this vulnerability, where -- where does this come from? It comes from the AI infrastructure that you have in the company. And everyone now that's starting using AI, they have dozens or hundreds of tools that are used for AI. And these tools are actually bringing real risk right now. So, if you -- if I think about the best practices from -- from this vulnerability, is number one, you need to know what AI tools are being used in your company by -- by the AI researchers. And again, I am -- I want to endorse AI usage, but I -- I need to be able to say, "I have visibility into all of the AI environments and all of the AI tooling across my company," right? And the second step is, as we saw here, AI models are -- are great, but they're also kind of risky. So, you need to define AI governance processes. So basically, which projects are using AI? Which models are using? What's the source of the model? Where are you testing AI models? Is it running in a test isolated environment? All of those are definitions that each company needs to do. And I call this AI governance. It's composed of AI discovery, the ability to define AI testing, all of those processes are important to define right now. And every team that has, you know, an AI team and a security team, they should start working together to define those kind of practices. It's always better to do it early, than to do it later. [ Music ]

Dave Bittner: Our thanks to Amy Luttwak from Wiz for joining us. The research is titled, "Wiz Research Finds Critical NVIDIA AI Vulnerability Affecting Containers Using NVIDIA GPUs, Including Over 35% of Cloud Environments." We'll have a link in the Show Notes. We'd love to know what you think of this podcast. Your feedback ensures we deliver the insights that keep you a step ahead in the rapidly changing world of cybersecurity. If you like our show, please share a rating and review in your favorite podcast app. Please also fill out the survey in the Show Notes or send an e-mail to cyberwire@n2k.com. We're privileged that N2K CyberWire is part of the daily routine of the most influential leaders and operators in the public and private sector, from the Fortune 500 to many of the world's preeminent intelligence and law enforcement agencies. N2K makes it easy for companies to optimize your biggest investment: your people. We make you smarter about your teams while making your teams smarter. Learn how at n2k.com. This episode was produced by Liz Stokes. We're mixed by Elliott Peltzman and Tre Hester. Our Executive Producer is Jennifer Eiben. Our Executive Editor is Brandon Karpf. Simone Petrella is our President. Peter Kilpe is our Publisher. And I'm Dave Bitner. Thanks for listening. We'll see you back here next time. [ Music ]