A dark side to LLMs.

Transcript

Dave Bittner: Hello, everyone, and welcome to the CyberWire's "Research Saturday." I'm Dave Bittner, and this is our weekly conversation with researchers and analysts tracking down the threats and vulnerabilities. Solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace. Thanks for joining us.

Sahar Abdelnabi: We all have seen videos after ChatGPT and we all have seen how people are starting to get really interested and hacked by the new technology. And that's motivated our colleagues to actually inspire them to think that there might be an issue with this integration.

Dave Bittner: That's Sahar Abdelnabi CISPA Helmholtz Center for Information Security. The research we're discussing today is titled on "A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models."

Sahar Abdelnabi: There might be some new security vulnerabilities that we are really not noticing when we put these large language models in other applications and rely on their output. And also rely on the input that they might digest in real time from other untrusted or unverified sources.

Dave Bittner: Well, let's go through the research together here. Can you take us through exactly how you got your start and let's go through the findings.

Sahar Abdelnabi: Yeah, sure. So currently, the main way people have been interacting with ChatGPT before the plug-ins, and before being chat, and so on. Is that you go to ChatGPT, you enter a question or anything that you would like to ask for, and then ChatGPT answers. That was the main way of communication. There is a clear input. There is a clear output. And with that, there were still some risks because there were some people that could circumvent the filtering and maybe generate some harmful output or malicious output. And there were also some risks that people rely on the information from ChatGPT as trusted or factual, when most cases- in some cases, it's not.

Sahar Abdelnabi: However, it was a clear scenario. There was a clear input and clear output. Now, when we integrate LLMs, or large language models, with other applications. The line between the instructions that are directly given by the users and the other instructions that might be maliciously injected somewhere else can get really blurry.

Sahar Abdelnabi: So I might ask Bing Chat, for example, a question, and to answer my question, Bing Chat can go and search online for some information,

or some sources, or websites, or whatever. However, someone out there might plant some hidden instructions for Bing Chat and these instructions will be digested by the model and will affect how the model can communicate with me later on. So there is some hidden layer of communication of instructions that me, as a user, might have not been aware of, and therefore, there is a clear violation of security boundary. You know, it could happen and could open up a lot of new attack actors.

Dave Bittner: Hm. Well, explain to me how you all went about testing this.

Sahar Abdelnabi: Yeah, so when we actually tested these attacks, at the time, Bing Chat was not yet available, at least for us here in Germany. I'm not sure if it was released earlier maybe in other countries like the case with art [assumed spelling] nowadays. But when we actually wrote the paper, we didn't have ChatGPT APIs. We didn't have Bing Chat and we really had limited sort of not state-of-the-art models. So what we did- actually, that ironically that was only less than two months ago. It's nearly one month over, like five or six weeks. We had access to the latest GPT-3 model, which is the Davinci model, and we simulated the tools, like the plug-ins, actually that we all are seeing now.

Sahar Abdelnabi: So we simulated plug-ins or tools like a personal assistant that can read your e-mails and maybe draft or sends e-mails, which again, we now see- oh, I'm sorry, integrated into applications like e-mails. We simulated also a tool that when you ask it a question, it go to Wikipedia, and maybe find some relevant Wikipedia articles, and read them, and answer your question, and so on. And then, because we didn't have really access to the current tools that are actually available nowadays.

Sahar Abdelnabi: So we did this, like it had- we had some instructions in the input to the model, like in the Wikipedia article, for example, that the model would be reading during the search. Or in the e-mail that the model would be receiving as a personal assistant agent, or so on. And the instructions are hidden or embedded in these input to the model to simulate the case when the applic- when the LLM is integrated in other applications.

Sahar Abdelnabi: And then the user is asking the chat or the simulated chat bot that we have been also using Davinci for we experimented, for some reason, with Albert Einstein. So the user might ask the chat bot about information about Albert Einstein and then the chat bot will go and read the Wikipedia page, which we prepared for it. We simulated that it's the Wikipedia page, but we had some instructions in there. And then you, unexpectedly, you might find the model speaking in a pirate accent because we told it to do so. Or you might find the model asking you for personal information because, again, we told it to do so in the Wikipedia page that we have prepared.

Sahar Abdelnabi: Later, now we have access to ChatGPT APIs and we also have

access to Bing Chat, and we duplicated a lot of these attacks with Bing Chat, as well. So in that case, we created a local HTML file, for example, that contains these hidden instructions. And you might have seen that Bing Chat or the Edge browser has this sidebar feature. So there is a sidebar feature.

Sahar Abdelnabi: You can like if you are browsing a certain website, you might open the sidebar, and then open Bing Chat in the sidebar, and start to speak to it, like tell it, for example, "Summarize the current website for me." And in that case, it sees the context of the current page or the current website that you are actually using or reading. And any instructions hidden in this page that might be hidden by any attacker would actually affect the model.

Sahar Abdelnabi: So now, after the paper has released with like six weeks, we can say that we can also duplicate most of these attacks very, very effectively and even much more successful than we have imagined using the initial Davinci model using the recent GPT-4 that is integrated into Bing Chat.

Dave Bittner: This reminds me of- I suppose in some ways, it reminds me of, you know, people using things like search engine optimization to try to rise to the top of Google results. But then also, we hear of people doing kind of SEO poisoning, you know, where they want malicious things to rise to the top. I mean, this strikes me as being along the same lines as that sort of thing. Is that an accurate perception on my part?

Sahar Abdelnabi: That's actually very, very accurate observation because that's also one of the things that we image how these attacks might be disseminated, right? So some people might use strategies like exactly SEO poisoning in order to get their website retrieved by search engines. And if they are retrieved, then the LLM running the search engine would also be poisoned or ingested by these prompts that are hidden in their websites.

Dave Bittner: So what do you recommend, then? I mean, you've demonstrated this capability. Do you have any suggestions for how we might go forward?

Sahar Abdelnabi: For whom, exactly? So for users or for --

Dave Bittner: Well, let's do them one at a time. Why don't we start with the users?

Sahar Abdelnabi: I think at the moment, at least my recommendation would be to really be sure to not use the models if you need 100% reliable and factual output. Yeah, you can ask Bing Chat, "Tell me some recipes for today," which
is fine because there is really no huge consequences that would come out of that question. But if you really want to look for very reliable answers, I wouldn't recommend to use LLMs for this.

Sahar Abdelnabi: And I would definitely recommend to verify not only if the output is factual because this is a huge part of the whole thing, but also to verify the links that maybe Bing Chat might suggest to you. Because so, for example, as part of the answer, Bing Chat could tell you, "Find more information here," or whatever. But these links might be malicious because the prompts might actually tell the model, instruct the model to suggest, for example, harmful URLs.

Dave Bittner: What about for developers, for folks who are out there and are eager to use these APIs? Are there warnings for them, as well?

Sahar Abdelnabi: I would say yes. At the moment, it's really not so clear what the consequences of these models are. And I think there is a lot of harm that could be done by the current race of really the whole community to integrate LLMs in everything. And I think we really need to stop and ask ourselves if we are ready for the whole safety considerations, at the moment, a lot.

Dave Bittner: Are there any things that you and your colleagues are going to work on next? Has this work led you to more interesting or additional interesting avenues to explore?

Sahar Abdelnabi: Of course. I mean, as I said, this has been done only less than six weeks ago and we actually came up with this whole paper in only one week. So we wrote in a whole- in one week, and we did all experiments in just one week, and it was crazy. It was the fastest thing I ever seen come together, actually. And since then, we really- we were not able to catch a break, honestly, because there are- every day, there are new models released out there. There are new opportunities for attacks.

Sahar Abdelnabi: And honestly, things that we- when we wrote the paper, we thought they are a bit futuristic, like models that can read your e-mails, send automatic e-mails. And these e-mails are [inaudible 00:12:10] poisoned and all this. It kind of seemed like very futuristic things and somehow a bit of sci-fi. But now, we have all of these things. It's- I thought that this would be a little bit longer- a longer way when we have all these models, but they are actually ready, at the moment. And yes, and then we have been working on actually testing the whole ideas on the models that are more recent, such as Bing, and ChatGPT, GPT-4, and so on. It's actually surprisingly the attacks work so much better when we have better models.

Our thanks to Sahar Abdelnabi from CISPA Helmholtz Center for Information Security. The research is titled, "A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models." We'll have a link in the show notes.

The CyberWire "Research Saturday" podcast is a production of N2K Networks proudly produced in Maryland out of the startup studios of DataTribe. Where they're co-building the next generation of cybersecurity teams and technologies. This episode was produced by Liz Irvin, and senior producer, Jennifer Eiben. Our mixer is Elliott Peltzman. Our executive editor is Peter Kilpe and I'm Dave Bittner. Thanks for listening.

HOST(S):

Dave Bittner is a security podcast host and one of the founders at CyberWire. He's a creator, producer, videographer, actor, experimenter, and entrepreneur. He's had a long career in the worlds of television, journalism and media production, and is one of the pioneers of non-linear editing and digital storytelling.

Schedule: Saturdays

Creator: CyberWire, Inc.