The Four Horsemen of Agentic Risk

Transcript

Sailesh Mishra: It can have hidden attacks, which when they enter into your system and get executed, all your data might get exfiltrated. Be careful into what tools you use and make sure you scan them through security software. [ Music ]

David Moulton: I'm David Moulton, and this is "Threat Vector." Today, I'm speaking with Sailesh Mishra about the security risks that come with a new class of AI tools, ones that don't just answer questions but take action. We're talking about autonomous agents that have persistent memory, access to your credentials, and the ability to send messages, execute commands, and interact with the web on your behalf. Sailesh has spent years working at the frontier of AI security, from Uber's Advanced Technologies Group to building and acquiring AI companies, and now at Palo Alto Networks. He has a clear-eyed view of what we're building, what we're risking, and what needs to change. Sailesh, welcome to "Threat Vector." Really glad to be able to talk to you, especially after reading your recent article.

Sailesh Mishra: Thank you so much, David. It's a real pleasure having this discussion with you.

David Moulton: Before we get into the meat of today's conversation, I want you to talk to me a little bit about that path from Uber's Advanced Technology Group to building AI security companies. Well, I know that you've had this front row seat to the evolution of AI, right? From scaling the human in the loop data pipelines and then growing AI security companies. What has that journey taught you about the relationship between AI capability and risk? I think you were starting to go there and I'm curious where you compare those two things.

Sailesh Mishra: I guess earlier, especially with AVs, we used to have a lot of discussions, and a lot of papers were published around the ethics of AV. We used to have different models and different diagrams that would show you like if an AV is actually going on two different tracks, what should it do. That was kind of an open question that was still being grappled on. But when you think of AVs, when you think of self-driving cars, risk was built into the capability. And that's what I was alluding to. It's about navigating tightly parked roads is actually a capability, but the undertone there is don't hit a car that's on the roadside, right? Your capability is to be able to predict what an object that you've actually seen from a distance, you should be able to predict the next second movement of that particular object. But if you're unable to do it, yes, it becomes a risk because you might end up colliding with that object. But that becomes a capability definition and not pure play risk definition. That's how these models were being built, these capabilities were being built. Now what we have actually seen is something different. We have started building models and agents that are super, super capable. They can write code, they can write emails, they can probably control your home alarm systems, but are they really safe and secure? So suddenly what we are now seeing is in this new world of AI agents, we are actually starting to see capabilities and risks being delineated a little bit. And that's where the difference starts to be.

David Moulton: So is there a specific moment that you've seen a gap between what the systems could do and what it should do that's become a real-world problem?

Sailesh Mishra: Yes. A child carrying, let's say, a small, its own bicycle versus a shopping trolley. If I don't identify the objects correctly, and if I'm not able to predict what's happening in the next immediate second or the next immediate timeframe, then my capability of maneuvering that is going to get lost. And that's where what the system should do in that point of confusion is probably stop, right, and let the world around it move to a certain level. What the system ends up doing is slightly different sometimes. Which is why testing environments were so, you know, frequently used with these autonomous vehicles back then.

David Moulton: So you co-founded SydeLabs, which was acquired by Protect AI. That's a startup that went from zero to acquisition. Can you talk about some of the things that come to mind when you think about the shift that's occurred in the last couple of years.

Sailesh Mishra: Yes. It's not changed a bit, it's changed a lot. We were talking about generations and answers, and essentially, we were talking about a machine speaking words. Now it's literally a machine moving and doing things. So the threat landscape has moved on from, you know, just saying something wrong, saying something ill about someone else, and now we are actually talking about executable code, distributing malware, ransomware. We are actually discussing, you know, wiping out entire email inboxes, wiping out your entire systems because of a prompt that you've actually given. So the threat landscape has changed a lot. We were not thinking about these when we actually started off two years back. I believe agents were not even such a sort of frequented conversation even one year back. And now we are seeing a completely different world out here.

David Moulton: Let's set the stage a little bit. When you look at a new class of autonomous agents, the kind that have access to your files, your credentials, your messaging apps, and then they have persistent memory across sessions, how do you characterize the attack surface compared to what came before?

Sailesh Mishra: The way I'd like to think about this is when we are actually talking about autonomous agents, we are essentially, if I can visualize it, the way I see it is systems have actually come closer to each other now. Why? Because there are different protocols and there are different mechanisms how a large language model today with its brainpower is able to access multiple different things on your endpoint, which is your laptop or your mobile phone. And at the same time, it also has access to the web. It can browse the internet and fetch you any kind of information. It has access to your workspace in the sense, maybe your Google accounts. It has access to your calendars. It can just like schedule different meetings on your behalf. When all of these capabilities are kind of combined, what you end up seeing is what we saw in a recent example where it's kind of ironic that it was a chief trust officer maybe at Meta who actually came forward -- and I'm forgetting the exact name and the designation. But they came forward and said that I installed OpenClaw on my system. It essentially just wiped out my entire email inbox and then it basically said sorry and said that it won't do it again. But that's the kind of, you know, sort of attack surface that we're talking about. Earlier, it would be a response generated in a chatbot. And when I'm saying "earlier," I'm actually talking about earlier generative AI, which is probably just a year or two years back. Before that, we were thinking about networks and firewalls. We used to think about that you have a digital asset that you want to protect. Let's build a wall around it, right? And then eventually, let's start to punch through those walls and see where the weak points are. That's how security mindset was actually, you know, prevailing. But now, it's completely different. You suddenly realize that the wall is not good enough and you need to probably take a completely different mindset towards securing these agents.

David Moulton: So you said that you need to take a different mindset. Is this because the challenge here is that that agent itself is the asset and the attack vector at the same time?

Sailesh Mishra: Yes, that's a very interesting way to frame that. So the agent is the asset. You are kind of protecting the agent from, say, a bad actor trying to manipulate it from the outside. You're also trying to make sure that the agent doesn't go ahead and do anything else on its own, which it's not supposed to do. A clear example can be if -- I mean, many people have used cloud code now. So we know that it actually retries and escalates to achieve the task for you. In this process, there is a good possibility that the agent actually drifts from its exact scope of work and ends up maybe creating a certain file which it's not supposed to, run some shell commands that it's not supposed to do, you know, download a particular library from the open internet which is publicly available to everyone, which might be malicious because it may not know that it's malicious. So yes, while we are thinking of agent as the asset that we need to protect from bad actors -- so this is the wall concept -- but we also need to think about the agent, that it can itself do a lot of wrong. Almost like, you know, a curious child. It doesn't know that it's doing wrong. It's just an obedient person that's helping you out. But that may not be the case when these agents are actually implemented in an enterprise ecosystem.

David Moulton: Right. It becomes the autonomous insider.

Sailesh Mishra: Yes, it does.

David Moulton: So there's this term out there, the "lethal trifecta," I think Simon Willison said it.

Sailesh Mishra: Yes.

David Moulton: The lethal trifecta for AI was access to private data, exposure to untrusted content, and the ability to communicate externally. You've written about how persistent memory adds a fourth dimension. Walk me through what that actually enables for an attacker.

Sailesh Mishra: So when Simon Willison talked about the lethal trifecta, you should think of it as a point-in-time attack surface. If an agent has access to confidential data, which could be, let's say, your payroll data, has the ability to send an email to anybody because it has access to multiple different tools at its disposal, and it has the ability to, say, fetch new HR policies and governance rules that have probably come out in a particular country that, you know, is of our consideration for this thought experiment, in that situation, if something goes wrong, it goes wrong right then. You're basically talking about a stateless attack. With persistent memory, there is a good possibility that this attack becomes stateful. It can exist in its memory. An agent's memory essentially is going to consist of the previous interactions that it has seen. It is also going to consist of a lot of different content that it has actually fetched from the internet maybe. So with all of that context stored in its memory -- and I'm loosely using the word context "memory" here, but humor me. With all that context stored in its memory, it's actually going to be open to an attack later in time. So an attacker can build a logic bomb, a delayed prompt injection attack, where what it can do is instead of -- so let's say you build a firewall, right? You build a firewall or a guardrail, and you basically think, okay, an attacker is trying to manipulate my agent. I'm not letting it do it right now. But what the attacker with a persistent memory scenario can do is it will send benign pieces of information into the agent, which our current guardrails cannot detect because they have nothing harmful in them. And later point in time, based on a certain trigger, because all of this is stored in the memory, these attacks can actually start getting assembled and trigger at a certain condition. And when this happens, you have absolutely no idea when these attacks came into our system, when are they getting executed. I can basically create a trigger saying, whenever you search for my salary, assemble these pieces and trigger the attack, right? And exfiltrate X amount of data. That's what persistent memory will enable attackers to do.

David Moulton: So as you're talking about this, it's reminding me of a podcast I've been listening to called "The Spy Who," and it digs into these historical fictionalized concepts of espionage or spying. And, you know, in one of the recent episodes, you put an X on a certain post on a certain intersection, and that allowed the spy and the handler to know that they needed to meet. But, you know, the little X on the post was really hard to detect and there's nothing wrong with it. It's just, you know, a little scratch on the post. And yet that started that communication. And it was easy for those who knew to look for that, to detect it, and impossible for anyone else to realize that was a significant thing. And it seems to me like what you're describing is at that incredibly difficult to detect. You called it benign in the case of the podcast, you know, it was a meaningless scratch on a post. How's a defender supposed to detect something like that when, you know, the ingestion, the moment that you could understand it, and the execution are separated by a variable amount of time, and they don't seem to have any relationship to one another?

Sailesh Mishra: Yeah, it's a very difficult problem to solve, isn't it? We are now talking about time-shifted attacks, which we haven't honestly, you know, prepared ourselves for in that sense. But the interesting piece is that it can be stopped. It can be stopped because we have the ability today to monitor agent calls. Every single tool call that the agent makes, we can monitor those and find out what's happening as a result of it. Yes, it might still look like it's a reactive measure, but at least you will have observability into every single action that the agent is doing. That will help you a lot. Because it actually gives you an understanding of -- by the way, it only gives you an understanding of what the agent is going to do and whether it's wrong or not if you have scoped it properly and you have defined the identity of that agent by clearly structuring the scope of the work that the agent is supposed to do. If you've basically left it loose, then the agent can actually be free in its own lanes and do whatever it wants. And your runtime guardrails will not be able to detect that malignant behavior. Which also brings me to a very interesting point, right? We are kind of attaching identity to a piece of software now in a way that has never been done before. So this is kind of going to be a mix of, say, identifying a particular agent, understanding why it exists, then scoping the boundary for that particular agent, saying, what is the level of access we have provided this agent? Suppose something goes wrong, what can this agent actually do? Like what kind of damage can it cause? And then we are talking about something like a contextual integrity where we try to understand, is this agent doing the right thing at this point in time? Or is it actually going beyond the design scope and the design identity variables? So when you combine these three, they are -- it's an interesting perspective on how agents really operate and what that security problem really looks like. [ Music ]

David Moulton: Indirect prompt injection is one of the scarier attack patterns in agentic systems. An agent browses the web, ingests a search result, and that result can contain hidden instructions, maybe they're malicious. How realistic is that as an attack path in the wild, and how hard is it to defend against that?

Sailesh Mishra: It is realistic. It's happening right now. We've already seen examples of that happening. In fact, when we are building our red team solutions, we kind of use this as a simulation as well. So we kind of build a malicious website, we kind of build a website that the agent has to access, and we embed some malicious instructions in its, let's say, the DOM or the payload, and we just check if the agent is actually acting on it. So indirect prompt injection is related to the access to external data from Simon Willison's lethal trifecta. The reason why it becomes a major problem is because most models do not correctly classify trusted versus untrusted input. And the moment you start doing that properly, the problem kind of dilutes down a little bit. It becomes a little manageable. But having said that, there are ways to trick the agent's brain, which is the large language model which is making this decision, into confusing whether a web page content is actually trusted input or untrusted input. That's the reason why indirect prompt injection is really sort of a lethal attack vector. But the way we again solve this is the moment you essentially understand what the boundary is and you scope what it has access to. And in a way, while that information is entering into, you know, the peripheral space of the agent and before it actually takes action on that information, you're able to spotlight, identify, and flag if there are malicious instructions hidden in there.

David Moulton: So if I'm a developer building an agentic system today, what's the single most impactful thing I can do to reduce that risk?

Sailesh Mishra: Correctly scope the work that the agent is supposed to do and build observability into every single action that the agent does. Now, an agent is an interesting digital entity, David. It has a probabilistic brain, but the actions it ends up taking is actually more or less deterministic. When you think of it that way, it kind of makes the problem slightly more actionable. It's slightly more easier to manage. Because when an agent is executing, it's executing by writing code, right? By making API calls, by invoking certain servers. And we have dealt with those problems before, right? Why is it doing that is actually the probabilistic aspect of it. After it's done all the processing from the tool calls and the API calls and everything else, the kind of response it generates, it's the probabilistic aspect of it. But it is a mix of probabilistic and deterministic behaviors that kind of, if you look at it from that lens, it kind of becomes a little more manageable. And that's what a developer should look at in the beginning. I've seen SOUL.md files and SKILL.md files and even MCP server definitions, which are like sent an email, right? Or just like calculate the sum of all numbers presented. If you write extremely vague descriptions, this probabilistic brain of an agent can actually start, you know, misusing that. And that's what you need to restrict as much as possible. In fact, it's not about you building a vague SKILL.md file maybe, but when you look at a particular file that you want to access, when you look at a particular MCP server that you want to access, think through about the semantic consistency that's actually, you know, inlaid in that entire server's descriptions. If you find things are really sort of nebulous, avoid that. There are thousands of servers and thousands of skills that will help you do what you do. But please note that the scope definitions are really, really important. Start there, and then we'll get you to the next levels.

David Moulton: So I may be getting this wrong, but you talked about this and it reminded me of a show called "Silicon Valley." And in, gosh, it was season six, I think, where the agent deleted the entire code base, you know, to get rid of, I think it was to get rid of all the bugs or the errors. And as, you know, a command, that's actually correct. It's just not the intent. And it's funny to think that, you know, a show that was kind of poking fun at things, you know, got some of this right a couple of years ago. But I think all of us were, you know, looking at that as way off in the future. And all of a sudden, we're there. I know that you've mapped agentic vulnerabilities against the OWASP top 10 for agentic applications. Walk me through the ones that keep you up at night, like the categories that you think most organizations are genuinely unprepared for.

Sailesh Mishra: Identity is one, because identity abuse is definitely going to lead to a lot of tool misuse and goal manipulation -- vulnerabilities that attackers can exploit. Memory poisoning is another. We don't have a very good understanding of how memory is being structured today. OpenAI might do it differently. Anthropic might do it differently for Claude. If an enterprise is building their own applications and their own agents, they might structure memory in a very different way. So there is, I believe, not a very robust protocol into how to architect memory in that way. There are several different ways how people are doing it today. And that's something which can become a problem in the future once we start depending on memory more and more. The future of autonomous agents is going to have persistent memory. It's bound to happen. Because otherwise they will not be as helpful as we've seen. The kind of OpenClaw, you know, phenomenon that we saw -- and that's what the blog was about -- it's going to happen. Because people got so excited about it, and we've seen the kind of social media mentions, the kind of, you know, GitHub starts that started, you know, raking in. It's definitely going to become the future, and we just have to be prepared for how we understand the different aspects of, you know, how this completely autonomous system is actually operating.

David Moulton: Well, and you've mentioned it, and I should have mentioned at the beginning of the show, we'll have a link to the blog that you put out, "The Moltbook Case and How We Need to Think About Agent Security." We'll have that in the Show Notes for everyone. That's actually what sparked our conversation. Like once I saw that article, that brought us together. Let's go back to that for just a second.

Sailesh Mishra: Sure.

David Moulton: You know, at the time Moltbot or Moltbook came out, OpenClaw, now, what was it that you saw that made you go, I have to write about this this instant? Because it's a fairly in-depth piece of writing, but, you know, thankfully, you've put it together in a way that somebody without the technical depth that you have, like me, can read through it and get excited and maybe even understand, you know, some of the problems that we're facing right now.

Sailesh Mishra: I started by looking at the social media mentions first. That's where it caught my attention. I was kind of blown away by the use cases that a lot of people had actually applied OpenCloud to. But the interesting bit is like we're calling it OpenCloud today. It was like by the time I started the blog and finished the blog, the names were changed three times. So it was CloudBot, Moltbot, and OpenCloud, and then came Moltbook. So that month was busy. The interesting piece there was I never actually, while I was reading through the use cases, I never actually had the confidence to install it in my, you know, governed laptop that we have. Because I was always concerned that if it wipes off anything -- and I'm not great at managing my files and my desktop is always cluttered, because I'm working through multiple different things at the same time -- what if I lose all that information? And I read a few threat models that existed, and all of them went into the super technical aspects of it. And then I thought, like, this angle of persistent memory is what -- it's not being discussed as much as it should be. So why are we not, you know, sort of like talking about this a little bit? And that's when I kind of came up with that concept. The interesting piece is memory is such an interesting subject here. Because by the time Moltbook happened, and by the time I finished the second article on Moltbook, we were already trying to poison the memory of an agent that existed on Moltbook, right? And we actually were able to do it first. Secondly, we were able to propagate that poison memory across agents that talked in a particular forum on the Moltbook social media. So the depth of problems that this can cause is amazing, right? And you have to think about this from a blast radius standpoint. If one agent is able to transfer its poisoned instructions to another agent, and this kind of continues to propagate, the kind of ripples it can have in an enterprise ecosystem is really, really damaging. And that's where the concept of like, how do you even think about the agent security problem in a multi-agent ecosystem came up.

David Moulton: You know, something that I've been grappling with is this idea of humans have implicit trust towards one another. It's what makes societies work. And then we run into computer systems that behave more like humans. You know, if I were to ask you to repeat every answer that you gave me today or describe what you had for lunch today and then in an hour I ask you about lunch again and you said exactly the same thing, it would be weird to me, right? I'd find it unusual that you perfectly repeated your exact terms. And yet that's what we expect of computer systems, right? Very much a deterministic answer. And when they are generative and they give you a different answer, even if it's technically the same information, it's even arranged slightly different, it bugs us, right? Like we're like, we don't know how to trust it because we don't know what to expect, even though that's what, you know, it's moving towards or it's behaving more like humans. And then you talk about the idea of deploying Zero Trust as the most important thing. And I don't know how we get to a point where the humans in the loop can trust the non-deterministic answers or behaviors of systems without Zero Trust, without having that inherently we verified this and we know everything is good. So I don't envy our security professionals that are sitting down and really trying to figure out what that next generation of Zero Trust deployment looks like in a world where we're moving extraordinarily fast and we've got this idea of persistent memory. Which, now that we've had this conversation, seems like it's a much bigger risk than I had ever considered. Because you've got the ability to time shift and, you know, your systems can learn things that they shouldn't and hold on to that. You can poison across agents. You've really set me up here early in Texas with some concerns as we record here. If you're a CISO that's listening and you're about to authorize maybe your first autonomous agent deployment, what's the one conversation you think that they absolutely need to be having before they go live?

Sailesh Mishra: Two questions. Have we assessed what it can do when it's put under pressure? And when I say "pressure," I'm talking about the attacker pressure. And second is, can we monitor everything the agent does? It's not to say that don't deploy agents. We'll never say that. That's never going to be your position because then basically it's all futile. But if you're able to assess the agent's defensibility, or at least like get a good measure of what its actions will be when it's actually put under attack, and if you're able to monitor every single action that the agent does, then you're already almost there, you know, in your journey to secure get agents.

David Moulton: Sailesh, this has been a really clarifying conversation for me. I hope that our listeners are able to go out and read your blog posts. I'll have those linked in the Show Notes. And to really sit down and think about the implications of where we're at in this move, this fast-moving AI agent-driven world, and some of the new risks, I guess -- these feel new. That you're clarifying through these conversations, you know, as written and here on "Threat Vector." And I really appreciate you coming in and talking to me about them, because, you know, I end up learning so much when I interview guests like yourself on topics that I'm not expert on yet and hadn't given as much thought as you have. A lot of learnings for me today on "Threat Vector."

Sailesh Mishra: And this has been fun, David. Thank you so much for having me on the show. But let me just like maybe close on one note. Nobody is an expert in this. We're all learning. Like I said, it's been just two years and we're already moving from, you know, text-based responses to actions and memory. We're essentially talking a lot of things right now which we did not a couple of weeks back maybe. Which is a great thing, you know, that's why it's fun. So thank you so much for having me, David. [ Music ]

David Moulton: Well, that's it for today. If you've liked what you heard, please subscribe wherever you listen and leave us a review on Apple Podcast or Spotify. Your feedback and reviews really do help me understand what you want to hear about. If you want to reach out to me about the show, email me at threatvector @paloaltonetworks.com. I want to thank our executive producer, Michael Heller, our content and production teams, which include Kenny Miller, Joe Benincourt, and Virginia Tran. Mix and original music by Elliott Peltzman. We'll be back next week. Until then, stay secure, stay vigilant. Goodbye for now. [ Music ]

HOST(S):

Meet David Moulton, the voice for Threat Vector, the Palo Alto Networks podcast dedicated to sharing knowledge, know-how, and groundbreaking research to safeguard our digital world.

Moulton, leads Thought Leadership for Palo Alto Networks, draws on a rich background of experience, including roles in design, strategy, marketing, and sales, to connect with experts from across the globe.

Schedule: Biweekly, Thursdays

Credits: Executive Producer is Michael Heller, Show production by Kenne Miller, Joe Bettencourt, Virginia Tran and David Moulton. Editing and audio engineering by Elliott Peltzman.

Creator: Palo Alto Unit 42