The BlueHat Podcast 8.7.24
Ep 34 | 8.7.24

Navigating AI Safety and Security Challenges with Yonatan Zunger

Transcript

Nic Fillingham: Since 2005, BlueHat has been where the security research community and Microsoft come together as peers.

Wendy Zenone: To debate and discuss, share and challenge, celebrate and learn.

Nic Fillingham: On The BlueHat Podcast, join me, Nic Fillingham.

Wendy Zenone: And me, Wendy Zenone, for conversations with researchers, responders, and industry leaders, both inside and outside of Microsoft.

Nic Fillingham: Working to secure the planet's technology and create a safer world for all.

Wendy Zenone: And now, on with The BlueHat Podcast. [ Music ] Welcome to The BlueHat Podcast. Today we have Yonatan Zunger, and we are thrilled to have you here. Yonatan, would you introduce yourself? Tell us who you are, what do you do?

Yonatan Zunger: Well, hi. Thank you so much for having me on the show. So my name is Yonatan Zunger. I am currently CVP of AI Safety & Security at Microsoft, as well as Deputy CSO for AI. And you know, my job is to try to think of all of the things that could possibly go wrong involving AI, and figure out how we're going to try to prevent them from happening. I think that's sort of the short version of it. You know, I came to this from a career originally as a theoretical physicist. Went over, moved over into CS sort of full-time back in the early zeroes, where I started out building heavy infra. You know, I built a lot of the core part of search at Google, a lot of planet-scale storage, things like that. And then in 2011, I became CTO of Social. This was just at the time that Googe Plus was about to launch. This was also the time that GDPR was being drafted, and you know, within three weeks of taking that job, it suddenly became very clear that the hard part of this job wasn't going to be software infrastructure, it was going to be people's safety. It was security, privacy, abuse, harassment, policy, all of these things. And I discovered that I genuinely loved that. I fell - I fell in love with the field of trying to really solve these problems, and that's been, I would say, one of my biggest foci professionally ever since. And so now I'm really excited, I'm getting to work on one of the craziest, hardest problems, like even by the standards of a pretty strange career, one of the strangest and hardest things I've ever worked on, and yeah, that's what I'm doing now.

Wendy Zenone: I love it. I love all the nuances and the - the human side of what you're doing. If you could let the audience know, for some that are still learning about the AI field, what is generative AI?

Yonatan Zunger: Well, yeah. That's a really good question, because you know, we've had AI of various sorts for a very long time, and generative AI has also existed for a long time, but it only became a really big deal in AI a little more than, like, a year and a half ago or so. And so, the way to think about it, there's sort of the traditional kind of AI. We're referring to it nowadays as predictive AI. What - what does your world look like in the world of this traditional AI? Typically, if I want to use a model for something, I'm going to build a model. Right? So the model user and the model builder is the same person, and you know, you take a bunch of examples, you train a model. What are these models generally good at? They're good at looking at a really large field of data and making a prediction, or a classification, or a recommendation, or something like that. You know, they're good at looking at these very, very large spaces and analyzing, and of course the problems you're dealing with now, because you're both model builder and model user, is you now really have to worry about, is my model biased? Did I pick the right training data? Does this thing have really weird nuanced failure modes? And then you have to think about all of the safety aspects of your integrated system, right? Am I using it wisely, et cetera, et cetera. Generative AI is a bit of a different world. You know, at the very deep technical level, you know, it's the same basic approach as, you know, we have for neural networks and all of these structures, but in practice it's often better to just think about it as a completely different technology for practical basis. Right, the idea of generative AI, sort of -- you know, at the very technical level, of course, what you're doing is you're predicting character sequences, token sequences, images, things like that. In practice, the way I would think about it is, you've got a model, and first thing to realize, in most cases you've got a generic model. Right? It's - it's a model where one person trains it, and you're going to use the same model for a huge range of applications. So the model trainer and the model user are now two completely different people. And what are these things good at? Basically there are two things that generative AI is good at. One of them is, it's good at summarizing or analyzing a piece of human type content, so natural language, or an image, or something like that. Right? It's very good at saying, like, you know, here's a paragraph of text, give me a summary. Extract the key ideas. Something like that. And the other thing it's really good at is roleplaying a character. And this is, like, the foundation of most of what we do with generative AI is basically a lot of creative use in roleplaying. Right, so you tell it, you are a customer service agent for, you know, Wombat Co., and you've been asked - you're about to be asked a question by a customer, and you know how to, like, search through the following databases of information, et cetera. Or you say, you're a programmer, you're a Python programmer, and you've been asked for your advice on this piece of code, and like, you need to write a function to do something. You're a security expert, and you need to help analyze this forensic, this set of forensic logs, something like that. So this sort of creative use of roleplaying is one of those fundamental engines to it. So I guess the way I say it is, what is generative AI really? At the innermost loop, it is a combination analysis and roleplaying action, which you can then build up to build all sorts of cool things out of.

Nic Fillingham: Yonatan, this might be too large a question, but what I wanted to ask was, it almost sounded like you described the entire breadth of AI. We - we were talking about just generative AI, but what - so what's beyond that, in terms of, so you talked about roleplaying, and you talked about sort of the ability to synthesize or summarize data. I'm obviously paraphrasing heavily. What else does AI do that's not generative AI? Again, probably a very large question, but how do we sort of think about these different roles and functions that AI can take?

Yonatan Zunger: Well, that's the predictive AI I was talking about. The generative AI.

Nic Fillingham: Okay.

Yonatan Zunger: Does the analysis and the roleplaying. The predictive AI is the stuff that does the classification, and the recommendation, and the analysis of that sort. I can actually give a really good analogy from the human brain. If you think about how the vision system works. Right, so the human vision system is a stack where the very top - the very first input to the stack is the retinal neurons. Right, so you have the direct things that are measuring brightness, or color, or something like that. And you have several stages in the stack, which then go from pixels, quote on quote pixels, to very small curves, to large curves and shapes, to a two-dimensional shape recognition, to three-dimensional object recommendation, and so on. Now, this is very similar to how predictive AI works. Right, you have a model that is just scanning a tremendously large range of things and pulling a small number of features out of it. Once you've pulled out those first things though, your next layers of the stack is saying, oh, I see a three-dimensional shape. Wait a moment, I recognize that shape. That's Wendy's face. It's, then I think it's like starting to go off and, like, identify what's happening around you, the sort of things you can articulate in words, like holy crap, there's a tiger and it's about to jump on me. That kind of higher-level processing is, in a lot of ways, more similar to what generative AI is doing. It's - it's narrative. It's literally - it's - you can think of it most usefully as processing that happens either directly in words at the highest level, or in sort of almost word-like concepts at the level below it. And the reason I bring up this analogy is because it highlights the way in which the two kinds of AI actually complement each other. Right, these are not replacements at all. What happens is that this predictive AI, you can really think of it as AI that specializes in looking at very large fields of data, and where the model tends to be very specific to the problem being solved, right? So, you know, the vision centers of your brain, if you tried to plug them into your ears, they would not work correctly. That's not what they're there for. Whereas these higher-level abstractions, the generative AI is really good for dealing with the higher-level abstractions of the things that you can narrativize, the things that you can turn into words. So in a - in a good, healthy environment, what you're generally doing is that you're using these predictive AIs to scan very large fields of data and reduce, you know, a mass of pixels into a statement of, oh, here's a picture of somebody's face. And then you're taking that information and you're adding it to the generative layer, to the narrative layer, the one that speaks in words and so on, and it then starts to assemble these, reason about them, talk about them, have these very, you know, generic kinds of conversation about them. So that's sort of the difference. That's the procedure.

Nic Fillingham: That is - that is a wonderful analogy. Thank you so much. I do want a quick pause. As an Australian, I noticed that you used wombat as your - as your example. Why - why wombats? Is there a story there?

Yonatan Zunger: Why not wombats, [laughter]?

Nic Fillingham: I love it. Why not wombats? That - we can put that on a sticker.

Wendy Zenone: [Laughter].

Yonatan Zunger: [Laughter].

Nic Fillingham: So, I want to sort of, like, help people sort of still continue to wrap their head around generative versus predictive and other forms of AI. Can you give us some examples of sort of positive uses of this, or - or to juxtapose against negative use cases here? How do we sort of think about the good and the bad, and I'm using air quotes, of this technology?

Yonatan Zunger: Well, you know, the good and the bad is very much in how you use it, right? So I mean, what - what are some examples of good uses? I mean, there - there are so many of them, honestly. You know, I'll just give some random ones that pop into my head. Dynamic temperature control for factories and data centers. I remember that - that was an example that came up, you know, a decade or more ago, but it turns out that you could have a system that stares at all of the temperature sensors across the building and controls whether to open windows or not, and how to run fans, and so on, and you can make a building spectacularly more energy-efficient by doing that. Self-driving cars, when they're not designed by maniacs, this is a technology that can save an awful lot of lives. I mean, well, social media's always kind of a complicated mixed bag, but if you think about this idea of helping people meet other people. Right, the - the actual driving purpose of this, and I think we often forget, given how many problems have emerged in social media, we tend to forget just how much good this has actually done in people's lives. You know, how many - how many people have, like, formed and maintained, like, their friendships, their jobs, their entire professions, sometimes. Their ro - romantic relationships. There's so many things that people have formed through this, and if you think about this, this is really about, a lot of this is the use of algorithms to try to help you find, who are the people you might want to actually be with? Who are the might - the people you might want to be around? With generative AI, you know, it's still a very new technology, so I think we haven't yet seen the killer app of generative AI, I think. You know, we're in a stage where, I think the single most important piece of office software of the 2030s, the category has not been invented yet. We're really at that new stage. The - the thing that is going to be the equivalent of what the spreadsheet was for the personal computer, or what direct messaging was for mobile phones, it doesn't even - we haven't even invented it. We don't even know what that thing is yet. Right now we're sort of seeing these very early examples with generative AI. I think we're finding that it's really good as an interlocutor and brainstorming partner, for example. I think there's a lot of very interesting potential there. It's also something that you would combine with a lot of more traditional techniques. Like for example, one of the classical challenges, things that you really can't do with AI today, is that you can't, sorry, with predictive AI, is it's not really good at understanding language. You know, understanding natural language is actually a very, very difficult problems. It's what we used to call an AI complete problem. It turns out, even - even pronoun resolution, that is, knowing what a pronoun refers to in a sentence, is AI complete, in a sense that it actually requires a full model of the world and a theory of mind in order to do. There's sort of the classic example. I think this example might be due to Steven Pinker, I can't remember for sure, but here - here's a sample dialogue for you. Woman, I am leading you. Man, who is he? Now, I'll bet you probably had no trouble understanding those two sentences and that dialogue, and you could probably tell me exactly who the he in that second sentence refers to.

Wendy Zenone: Yeah. Uh huh.

Yonatan Zunger: Now, explain to me who that he was without a complete theory of mind of both of the people and of how what the people are thinking that the other people are thinking, and so on. Of two characters that I've literally identified as just man and woman, already you have to solve that complicated a problem. So one of the traditional challenges that we've had in all sorts of data science is that understanding human language is really, really, really hard, and what's - one of the genuinely stunning things about the recent revolution in generative AI, the one that's happened in the past year and half or so, has been that we finally have software that's capable of just looking at a piece of human text and actually understanding it, and extracting information from it. So if I ask it to resolve that and to then transform that into a structured form, I can. Which means that I can potentially apply this sort of analysis at scale to large amounts of human data, and interact with human information in entirely novel ways, and of course interact with people directly. So I think there's tremendous possibility for some really wonderful things to happen here. Another suggestion that I've heard a lot talked about is, personalized education as a service. Imagine a situation, I want to learn about X, and this thing will help put together a syllabus, it will do, like, all of the research it needs in order to find all of the right information, figure out how best to teach it, and then it can teach me interactively, right? Because it's not just going to create a PDF or PowerPoint presentation, it can actually go back and forth, and work with me, and teach me all of the things I need to know. Imagine what this could do for the world, give everyone access to a teacher. So there's tremendous possibility for good. And of course, there's tremendous possibility for bad, because there is no single technology humans can come up with that can't be horrifyingly misused.

Wendy Zenone: [Laughter].

Yonatan Zunger: Just to take a simple example, we were just talking about education. That's entirely wonderful until what the person wants to learn is how to weaponize anthrax, or how to kill people, how to encourage a genocide. I mean, there's so many things that people might want to learn that are really horrible. And, you know, the point where we start to really run into deep nuances, and we sort of have to ask ourselves, well, these problems already exist in the world. How do we prevent them? Right, there are people in this world who do know how to weaponize anthrax, but I can promise you that if you went to one of them and asked them, hey, would you teach me how to do that? They would say no. They would probably not do that.

Wendy Zenone: [Laughter].

Yonatan Zunger: But their judgment about when and how they're going to do that is an interesting nuance. It's something that we need to figure out a way to capture, and express, and formalize. There's a lot of other very simple ways you can - you can misuse AI in basically any way you can imagine misusing any technology. One of my - favorite example might be the wrong for this, but one of my classic examples of a misuse is, there has been a whole business of using artificial intelligence to help make sentencing recommendations in criminal law. And ProPublica had an expose of this back in 2016, which I think is really worth reading. This works as badly as you might imagine. So for example, if you look at these companies, they - they were very careful not to take race as an input signal. Right, because that would be horrible. You shouldn't use that. But they did take income, and sort of your quantized address, like the neighborhood you lived in, and if you know anything at all about how American politics works, your income and the neighborhood you live in is a really good proxy for race in most of this country. And then you end up with sort of a proxy signal problem. The theory behind it was, well, we're going to predict who is most likely to commit another crime, and we will - we will recommend harsher sentences for people more likely to commit further crimes. There are a couple of obvious problems with this one. First of all, the sentence you give someone does affect their probability of committing a crime in the future, right? If you make it - make sure that someone can't get any sort of noncriminal job in the future, they're probably going to be criminals. But the even deeper one, the really - the big problem that really kills this is, when someone commits a crime, it's not like a giant lightbulb goes off over their head saying attention, this person has committed a crime. You can't actually measure the variable you care about, so they picked a proxy variable. They measured whether someone was charged with a crime. And the thing is, the difference between committing a crime, and being arrested for a crime, and being charged for a crime, that is not a uniform translation matrix. If you ask the question, who is more likely to be charged with a crime? The answer is, the black person. You know, they - what they basically built was a system to predict race. They picked a system that was modeled exactly to capture, measure the nature of institutional racism in the United States, and then implement that as sentencing guidelines. And this is a system that proceeded to go off and destroy a bunch of lives. I think this is a really good example of how not to use AI, of like the really, really dangerous, ill-conceived decisions. And in this one, you know, the obvious thing that people didn't think about was the basic question of, what happens when this thing make a mistake? Like, this sort of like the basic question you need to ask with any piece of software that you're building and any machine you're building. What happens if something goes wrong? And in this case, something very bad happened, especially because they set it up in a way where basically its recommendations were almost automatically accepted. So you have to really architect your systems around the possibility of failure, and especially for things like AI, where the system is inherently nondeterministic, where all of its categorizations or predictions or outputs are always going to be probabilistic. You have to be very, very careful and make sure that your system is robust. Your system, the integrated system including all of the people who are using it, and the people who will interact with it, are robust against the system being wrong.

Nic Fillingham: Wow. Gosh, so many directions we could go with you.

Wendy Zenone: Yeah, [laughter].

Nic Fillingham: My first question is, jumping off from that example. It's too simplistic to ask, where did they go wrong? I - I think I want to ask more about sort of ethics. When you design a system that has these - potentials for these kind of significant outcomes, do you hardcode in a bunch of ethical rules, or do you give the system the ability to monitor the outcomes, to then sort of adjust those sort of ethical guidelines that it's functioning on? Or do you still need, in 2024, human beings with their own ethical guide to be able to monitor and control it, or something else. How - how do ethics, as a - as a guiding force, but then also a component in an AI system, play into this?

Yonatan Zunger: Yeah, that's a really wonderful question, and I think there is no single answer to it. The - I think the correct answer to that question is very much dependent on the exact system that you're building. The way I would frame the approach to this, you know, this is, I think one of just the most basic lessons that I always try to teach people. There are two parts to engineering. Product engineering is the study of how your systems will work, and safety engineering is the study of how your systems will fail. You can't do just one or the other. And I think one of the great curses of modern computer science, the way that the field is working, which we desperately, actively, urgently need to fix, is that these are treated as two separate things, rather than part of the same discipline. You know, if you go talk to civil engineers, you will, you will see a very different story. Civil engineers are safety engineers who occasionally build bridges. It's a - it's a very different culture, and I think a much healthier one. What does a safety engineering culture mean when you are working with AI or something? Well, in fact it - it turns out AI and social media, I think, are very similar. Also search, gaming, like, all of - any software that really intimately involves humans and AI, you - you get very similar problems, and you need a similar approach. So what do you do? Very, very first thing, from the moment you even start to conceive of, hey, I've got a crazy idea. What if we did, dot dot dot. At the same time that you're thinking about what it could do, you're also thinking about, what might go wrong in this situation? And I have a whole set of things that I try to teach people about how to think of ways that things that can go wrong, and we're actually working with my team right now on writing up, like, training materials and creating things to help people learn how to do this. But the very, very first thing you're doing, is you're coming up with a list of things that could fail, a list of ways in which this thing could go badly. There's sort of your basic approach to this, by the way, that's a - a three-pass approach. First you go system-first. You look at each component of the system and ask, what happens if this thing fails? That might mean, what happened if it make - if it makes an error? What happens if it gets a malformed input? What happens if it gets an actively malicious input? What happens if it gets an unexpected input? Just all of the ways in which some component could go wrong. Your second way of looking at it is attacker-first. What if someone is trying to misuse this system? What if someone is trying to use the system, let's not even say maliciously, but for a purpose other than the one you intended? What might they be trying to accomplish? How might they use your system in order to accomplish that? And your third pass is the target first pass. That's where you're looking at it as, who are the people who might be affected by this system? What aspects of their lives might cause them to be affected by this system differently from other people? And what are - are there particular vulnerabilities in their lives that might be around? And so, you know, one of the things we're working on here, also, is checklists of sort of ideas to help people think of different possibilities here. I'll also say, this is the place where I always say that, this is the place where diversity, equity, and inclusion makes such a big difference in your ability to actually correctly do your job. Because the one thing I can promise you is that you cannot think of what every possible attacker or affected person might be experiencing in their lives. They are very different from you, because you are one person. Just, there are a lot of people in the world, and they're very different from each other. They have very different lived experiences. And having a broad team, a team with a really wide range of lived experiences, and a team that's empowered to speak up about those things, is critical to actually being able to do this analysis correctly. So your very first step, the very beginning of all of this, is you think about what could go wrong. Now, you've done that. You've got your list of threats. You bounce this off, like, a bunch of people. By the way, this is not a one-off process. This is the process you're going to be continually doing every single day of your life, from the day you first conceive of the project, until the day it gets shut down for the last time. You're thinking about what can go wrong, and then you're thinking, well, okay, for each one of these things, I need to have a plan. And your plan might involve mitigation, like preventing it from going wrong or making it less serious, and there's always going to be some aspect of it that you can't mitigate, right? Like, there are - there are problems in this world where like, I look at this and say, oh, I'm going to change the design of my system so that this thing is impossible. That's wonderful. When you can do that, that's your best choice. And by the way, this is also why it's very important to do this sort of analysis from Day 1, because often you can make a small change in the design of your product, just in the basic shape of it, that eliminates whole swaths of potential problems, while leaving the core product function that you care about intact. And that's often a really easy thing to do in your early design phase, and is almost impossible to do after you've built your entire system. So don't wait for that. I - I have seen projects, like, get like, you know, two weeks from launch, and then someone points out a basic problem with this, and surprise, you have to go all the way back to architecture. See you again in six months. Don't do that. That's awful. It's a terrible experience for everyone.

Wendy Zenone: [Laughter].

Yonatan Zunger: So, how do you - how do you sort of do this next step? Sorry, I'm - I'm going off into the complete spiel of, like, how you do safety engineering but.

Wendy Zenone: Love it. Do it.

Nic Fillingham: Please, please, this is wonderful.

Wendy Zenone: [Laughter], yeah.

Yonatan Zunger: So what do you do next? The next thing you do, is for each one of these threat scenarios, you walk through the way that the threat scenario actually happens. You know, you walk through the exact, what are the sequence of events that have to happen for this to go wrong? And the reason you do that is you start to highlight possible intervention points. Where are - where are things that you could do that would prevent that step from happening or would change the outcome of that step? Once you've done that for each of your threat scenarios, then you compare those intervention points across all the threat scenarios, and then what you'll often discover is there's a few intervention points that actually help you with a lot of different threats. And that's the point where we can start thinking about mitigations. How might you change your system, harden it, make it more robust to make those things less likely? You keep sort of doing this in a loop until you now have a hardened system, but - and every stage of this you've got sort of a residual threat. Right? You have events that could punch through all of those defenses and still happen. And so your last stage is always, when this happens, not if this happens, when this happens, what are you going to do about it? Right, and that's the, how will you know that something has happened? How will your respond to it? You know, for example, with a lot of user-facing software, this is the point where you start really thinking a lot about the user experience, by the way. You cannot treat UX as being distinct from any other aspect of your system. One example of this, well let's talk about abuse on social media. Right? So, turns out there's a lot of harassment and abuse on social media. That is, like, one of the primary things people use social media for.

Wendy Zenone: Sadly, [laughter].

Yonatan Zunger: Now, you are, so you've got various things that try to prevent it, but that's going to get through. It's going to happen a lot. So now you say, okay, let's say I'm running a social network and someone can create posts, and they can get comments on it, and the comments can be really terrible in various ways. Now, the objective of the user when they encounter one of these comments, like, they are going to be very upset right now. First thing, they want to get rid of this thing, and they want to make sure that this person goes away and never comes back. That is their objective. Now, you've actually got a bit of attention now, because the goal of the system operator is not just to get that detection, but to get enough information to figure out, like, did this violate policies? Does this - is this thing a signal of a broader problem? Right? Is this thing - is the user who made this comment a serious problem that we need to be kicking off the service? Or conversely, is this, like, something entirely personal between these two people that have nothing to do with this? Also, I mean, you know, abuse reporting is often actually done as an attack vector. People will mass abuse flag people they don't like, not because those people are being abusive, but just as a way to try to get them kicked off the service. So in fact, false reporting is a big issue. So the system operators really want to collect as much information and context as they possibly can about an abuse incident, so they can make a good decision. But these two goals are in tension, right? So, because - in fact, one really useful way to think about this is you think about emotional activation curves. They tend to spike very rapidly. You're looking at a timescale of between 500 and 2000 milliseconds, typically, to see an emotional activation curve rise. They decay. They - people calm much, much, much more slowly. That is a timescale of, typically for a small escalation, like, the - the - the [overtalking].

Nic Fillingham: Months, [laughter].

Yonatan Zunger: You know, minutes, actually. Minutes. Not months.

Nic Fillingham: Out of bounds.

Yonatan Zunger: But - but, if the event keeps happening, you keep moving up. Right? Imagine sort of a curve where you - you can either, every time a bad incident you add an exponentially rising curve, and whenever any - anything isn't happening, it then decays with a very long time constant. So you can keep going up, up, up, up, up. So your user has seen this, like, upsetting comment. They get their first spike. If there is a big red button they can hit to make that thing go away, then they can go right back into decay mode almost instantly. If there isn't, then for every - every time they look back at that comment, you're going to be getting another spike, and it's just going to keep going up, up, up, up, up. So it's actually really important that the user needs to be able to dismiss that thing on a timescale of seconds. What you really want is the time from them experiencing it to the time they're done with our problem to be 5 seconds or less, I would say, is sort of a good, rough rule of thumb. That means that the report abuse button, if report abuse makes you go through this whole, like, abuse reporting flow, where you have to now declare which category of abuse is it, and et cetera, et cetera, et cetera, you're getting good data for your team, but you're actually not achieving the core user need of getting into a safe state quickly. So the correct design of this kind of system becomes very, very subtle and nuanced. And this is actually sort of the core, now going back to where we started. You have a threat scenario of, you know, users experiencing abuse on the platform. You need to think of intervention, detection, response in a way that solves the user's problem first. And then separately, the question of how do you now get the signals that let you do a more detailed analysis? Right, because now you - what you've seen is okay, this user flagged this comment as being problematic. Most likely that's all of the information you've got. If you now want to look for broader patterns, you now have to actually think like a data scientist. You have to think, how - how do I analyze the situation to figure out, is there a larger pattern I need to care about? And then there's all sorts of things you can do in order to do this. So for example, here's one simple rule, let's say that you have one user that there's a real pattern that every time they interact with someone that they don't have a preexisting relationship with, the probability that that person is going to report their comment as abusive is unusually high. That's a really good sign that you are dealing with an asshole.

Wendy Zenone: [Laughter].

Yonatan Zunger: That's a real.

Nic Fillingham: [Laughter].

Wendy Zenone: [Laugther].

Yonatan Zunger: [Laughter], and - you know, there's - there's a lot of things like this. Actually one of the most important rules in abuse detection is, sometimes, you know, you're - you're looking at reports of things, and let's say that you're dealing with, well, if I'm dealing with comments on someone else's post, the -- the post owner should just have the right to remove anything they want to remove, period. But if I'm looking at posts, like sort of top-level posts, or things in the general forum, the criteria for removal is probably going to be a product-level criterion. Now, one thing that we have learned is that bad actors in social media are really, really good at figuring out the exact line of what they can get away with, and skirting really, really close to it. So they will always figure out some way to be just working around the rules so that each individual post never quite gets removed. A really important rule when you're doing abuse detection is if you see - if you don't remove something, nonetheless log that this thing was, like, close to the edge. Because one pattern you will notice is that hey, none of this user's stuff got removed, but wow do they have a lot of stuff close to the edge, and that is one of your red flags for an account. That account, you kick off the system. Right? So it's - it's this kind of thinking. And so now, okay, let's - let's go from specific back to the general. How did you - how do you approach this? What you were doing is, you had a threat scenario. In fact, you have quite a few threat scenarios tied to each other. You have various intervention points. You have intervention at the point that someone is making a comment. When they're seeing a comment, who do you introduce to each other, who do you, like, what do you let them see. You have a whole response pattern and so on. You keep adjusting this thing to try to reduce the level of threat, until overall, you look at your overall, here's my plan for the thing, and you decide, okay, this plan is reasonable. I think overall this thing is safe to launch. And you know, you're - you're doing some really interesting tradeoffs here because, for example, like, if you have to have humans reviewing your abuse queues, which you absolutely have to have, because computers are not yet at the state where they can do this automatically. In fact, humans are rarely at the stage where they can do this automatically. There's a whole side conversation there. That means you're, okay, so you actually have to do this expensive thing. You need people monitoring the system continuously and maintaining it, and how many people you need, well, that increases your cost, and if you have a failure that requires human intervention happening an awful lot, you've got a big problem. Maybe you need to rearchitect your system to make that failure happen less often. That's actually how you go about making the engineering tradeoff of how much do I need to mitigate this particular threat? What you're saying is, here's the residual cost of actually managing all of the failure modes after I go through this. Is that cost reasonable? If it's not, better go back to - better keep tweaking. And if you look at this whole integrated story I'm telling you, this is actually best understood as an alternative to the traditional risk management idea of likelihood and impact. Right? So you know, if you - if you started out in the world of risk management, you're used to taking each risk, each threat scenario, it's the same kind of thing here, assigning it a likelihood and a severity. And typically, sort of the product of those two is how important the risk is, and you go from there. And that's actually a terrible way to approach this if you're an engineer. Because that multiplication is really - it - it what it's designed for is for insurers. It's designed for someone who needs to manage a large portfolio of risks and sort of manage an overall risk budget. It's great for that. If you're trying to manage specific risks, it is terrible, because you're dealing with either things with very high - in fact, you can't even say likelihood, you have to talk about frequency. Right? A friend and colleague of mine, Andy Stow, he - he put it really nicely when we were at Google. He said, if something happens to 1 in a million people once a year, here at Google we'd call that 6 times a day.

Wendy Zenone: [Laughter].

Yonatan Zunger: And we choose our - and he - he had done the math on - on that one for a particular service. And so what this means is, you can't even be talking about rare likelihoods. In this case, you're talking about things that are happening continuously, and they're a continuous cost. Or alternatively, you've got failure modes that are incredibly rare, and whose impact is really, really high. And the product of a very small number and a very large number is not a medium number. It is statistical noise. There is no way to plan for any of this. If you actually try to design your safety plan by doing likelihood and impact, you will just end up in a complete madhouse. This is why I call it the when, not if method. Don't ask if this going to happen, it's gonna happen. For each one of these threats, what's you're plan? That's the question you want to know. So going all the way back to your original question, how do you deal with ethics and artificial intelligence? I think the real answer is, you deal with it by looking at what are the threats in your system, what are the things that can go wrong, and having a plan for each of them. And the nature of the correct plan, whether that's putting explicit rules in your system, or having humans checking various things, and so on, that's always very specific to the system you're building. You really need a solution that is designed and customized to your problem space, and you need to continually be observing, monitoring what's happening in product, updating your model of the threats, updating your plan for response, so that you're actually dealing with the things that matter. Sorry, that is my entirely not short answer.

Wendy Zenone: It was a great answer. I'm looking at your sign behind you. It looks like Smokey the Bear, but it's not Smokey the Bear. It is a.

Yonatan Zunger: It is Roki the Racoon. Roki the AI Safety Racoon. Only you can prevent AI apocalypses.

Wendy Zenone: [Laughter].

Yonatan Zunger: This is the log of our AI Red Team, which I absolutely [inaudible 00:34:24].

Wendy Zenone: I love that. And - and that kind of ties into my next question. It's like, you know, fight fire with fire. Are - you see all these products. Every product you're using, it's like, hey, you know, tap into AI. AI can help you. We can help you write this. We can help you do this. But are - on the back end, are we using - we as in, you know, humanity, using the AI to help secure AI? Is that, like, you know, fight fire with fire kind of thing, or protect with protection, you know, of the AI?

Yonatan Zunger: We are. We are, but we have just barely begun to scratch the surface of possibility here. There are many different ways in which we do this. Let me start with one of the simplest. When we actually think about how do we secure AI systems, right? And there's a whole, you know, we could spend an entire hour talking about just how you actually practically secure them. One of the key mechanisms, one of the most powerful mechanisms, is something called metacognition. And so to actually understand this one, let's go back to what we said earlier on, that generative AI is good at roleplaying, and it's good at summarizing things. A single pass through a generative AI system, that's basically what it does. It roleplays a character. There are no guarantees at this stage that it will be correct, that it will be achieving what you're trying to do, anything. It's like, it's really - this thing is dreaming, and that's okay. You know, what - what people refer to as the hallucination problem, which is actually a much more complex problem. I think that's a very poor name for it. What this really is, is if you take a single pass through the system and you're expecting the output to be grounded in some factual basis, yeah, no, that's not going to happen. What do you do? One of the really powerful things you can do is you can ask AI to roleplay an editor of various sorts. Right, so let - let's say that you're trying to do - let's give a concrete example. I am trying to build a chatbot that is going to be a customer-facing chatbot that answers questions about my products. And I have some kind of, you know, large website full of documentation about my system, but because people are bad with information architecture, it's really hard to actually find the answers you need in this website, especially if you don't know the exact question you have to. This is a great job for generative AI. So what does the generative AI do? Going to get a question from the user, and it's basically going to follow sort of a fixed kind of plan. The first plan is, it needs to look at this question, first of all figure out, is this even a question it knows how to deal with or answer. Right? If I am asking, you know, if - if I am asking Microsoft's customer service AI about how - how to make good pancakes, it should tell you that it has no idea, that is really not its job. Just bounce that off. Then it says, okay, well, in order to answer this question, what's the stuff I'm going to look up? It - it needs to come up with some search queries. So here it's roleplaying a customer service agent, right, who's the subject matter expert. It's saying, what are the right search queries to do - find the answer to this?

Wendy Zenone: To.

Yonatan Zunger: Comes up with a list. Now we're actually going to execute searches. This is not an AI step, this is the point where you just run searches. And you tell it to look at the results, and maybe you have some stage where you're sort of judging which results you want to grab. You want to summarize each one of those results. Again, one of the things that AI is good at. And you're going to pull all of those things together and you make an answer. And once you've got this nice answer, what you can also do now is a metacognitive step, a step where you tell it, look over this as an editor. Make sure that every statement in the output is actually factually grounded in one of the source pages, and attach a footnote to every single statement. Attach a footnote with a link, and if you can't - rough - if you can't footnote a statement correctly, take that statement out. That editing pass, that's actually how you eliminate fabrications from all of this. Now, there's all sorts of other ways you can do this, and so all I could talk about, at tremendous length. This concept of metacognition is really powerful. And part of the reason it's so powerful is because of roleplaying. Right? Because this is - this is just one of these magical things about generative AI. Generative AI was trained on human data, and it has cultural assumptions baked into it. So let's say I tell it, you re a compliance officer, or you are a - forget the compliance. This is, like, a very, like, you know, beige sort of use case. You are a responsible adult who's really cared about the safety of their community. Right? You tell it a story like that, then you tell it, look over the following thing and tell me, is this going to be a problem? What's really amazing is, I tell it, like, this sort - I give it, like, one or two sentences describing the character that's playing, and all of these assumptions that come with that character description, we'd actually kind of encoded into it. Right? If you tell it it's playing, you know, a rabbi or a, I don't know, a compliance officer, something like that, if you tell it to play this character, all sorts of assumptions intrinsically come in, because it was trained on all of that. It knows what these characters are. And so you can sort of adjust that, train that, tweak that so that you don't have to specify 5000 rules. You don't have to explicitly specify its ethical code. Rather, you give it a character that you describe well enough that it has an ethical code, and then you tell it to apply that to the outputs and analyze that way. And this is a technique that actually is proving very effective.

Nic Fillingham: The characters that it's playing are going to play by the rules that you assume. So for example, there you say, you're a compliance officer. How do you know that the compliance officer that it is going to play is not some fictional, villain version that it got from a TV show?

Wendy Zenone: [Laughter].

Yonatan Zunger: That - thank you for asking that question. I - I did not.

Nic Fillingham: [Laughter].

Yonatan Zunger: That was the exactly right question. And so this goes to one of the most important things we've learned. It turns out it's really easy to build generative AI software, and is really hard to test generative AI software. It's very easy to write a system that works great in the one or two cases that happen to pop into my mind, and when you give the real outputs, it turns out they do not do what you expected at all. And so the answer is, how - how do you make sure it does it, is testing. Though, in fact, I think one of the most important things you can really be doing is usually creating for each of your systems, first of all a bunch of test cases of just its ordinary function. Right? Give it a bunch of inputs that will look like real inputs it's going to get. And by the way, get other people to help you write those inputs, and get AIs to help you brainstorm further ways in which the input could look, because I can promise you that real users are infinitely weirder than anything you can come up with. And you manually, like, figure out what you expect to happen in each of these cases. Right? You - you take - you run this thing through the output, make sure the outputs look right. Not only that, you could even use another AI to get - and give it a rubric to judge, and to sort of do - do a first pass classification of, does this look more or less like what I expected? And then you can actually check the outliers by hand. Right, there's - building a set of - and then of course you can also build a set of [inaudible 00:41:04]  pieces for each of these possible harms. Right? What happens if someone puts in this following malicious input? Does it catch it correctly? And this is, in fact, exactly how we do testing, right? The - the testing frameworks that we build are all based on exactly this principle. We have a bunch of test cases, we have a rubric that is run by an AI - by yet another AI. Is then - what you do is you feed the test case into the AI - or into the system you're testing. You look at its output, you have another AI following this rubric, looking at the outputs and judging it, and then you have a human look at the overall outputs of all of that and actually sanity check, because tuning that rubric is just as hard as tuning the original. So you sort of have to keep planning, and you have to keep refining the rubric. And what's funny is, again, this is very similar to a pre-AI problem. In fact, let me give you a social media example, because that's actually what this is for. It turns out, writing these policies for things like harassment, and hate speech, and so on is tremendously difficult. Articulating what constitutes hate speech, like good luck with this. This is a massively difficult problem, and in particular, what you have to do is you have to write a policy that's going to be run by human analysts. Right? At the end of the day, you've got literal people sitting in front of terminals, reviewing items to see, do they match policy or do they not? And you need to - and you can measure the correctness of this policy in a lot of ways by looking at things like inter-rater agreement. Right, send a random subset of all of the items to multiple people. Do they get the same answer reliably or not? The answer, by the way, is they don't most of the time. It's very hard to write a policy that will cause them to reliably agree. When you're writing these policies, one of the other things you can discover is that what you wrote isn't what you intended. So I'll give one of my favorite examples. This is one that got into the press, which is why we can, like, easily talk about this. This - this one happened to Facebook. And they put a rule where, you know, encouraging violence, demeaning, condescending, et cetera, et cetera, against people base don protected categories was not permitted. Right? So you weren't allowed to call for, you know, violence based on race, or based on gender, or something like. Great. And now what happens if someone combines two attributes? Well, the answer was, if you have - if you have a statement that's calling for violence based on a combination of attributes, where all of the attributes are protected, then that is also forbidden. Okay, I just said something incorrect. What did I say wrong?

Wendy Zenone: Oh man.

Yonatan Zunger: I said it quickly, and you probably didn't catch it.

Nic Fillingham: I didn't catch it. Oh my gosh.

Wendy Zenone: I didn't catch it, [laughter].

Yonatan Zunger: And they didn't catch it either. Because I said all, when I should have said any.

Nic Fillingham: Oh, right.

Wendy Zenone: Ah.

Yonatan Zunger: As a result, they wrote a policy where - and what's funny is, their internal training material, which leaked, which is how we know this whole story, ended up following what was written in the policy. Men are trash. Canonical violating statement. Kill all the black children. Canonical non-violating statement. Because black is race, that is a protected category, but children is not a protected category, and I said all, not any, and so therefore, saying go kill all the black children was considered a classic non-violating statement because of basically a typo in the original rules.

Wendy Zenone: Oh man.

Nic Fillingham: Wow.

Yonatan Zunger: I can tell you, like, we had the same things happen at Google. We had - we had mistakes like this happen there. Everywhere - this - this state can happen everywhere. It's really easy. I mean, none of you, like, it's really easy to miss this thing. How do you prevent a mistake like that? Unit tests. When you're writing a policy, and this is not a job for engineers to write. This is, like, policy people. When policy people are writing policies, have them write out a list of examples which should be violative and non-violative, and give them, like, work with them. Like, you go back and forth and give them, okay, here's a harder example. Here's another example that's hard in a different way, and so on. And you build up this list, and every time you change your policy, you update that test list. Same thing with AI. If you're trying to implement a policy, like a metacognitive filter, if it's a rubric to evaluate the outputs of tests, something like that, give it a list of test cases, pro and con. And that way also if you ever have to do something like, you know, update your model version or something like that, you can retest and make sure the system is still doing what you think it's doing. Because otherwise, yeah, it can go really, really badly.

Nic Fillingham: Yonatan, we - we - first of all, we need to get you back on the podcast for a part two, three, four, five.

Wendy Zenone: Right, [laughter].

Nic Fillingham: For - into infinity. We are coming up on time here, and I wondered if this is a little - a good segue for us to talk about the role of, just very briefly, security researchers. So, The BlueHat Podcast, BlueHat Conference, this is a part of the security researcher community. You talked about - a lot of this was about sort of product engineering and safety engineering, which is obviously sort of on the internal side of the development of systems and products. What of that role, that unit testing, or the flipside of unit testing, can be taken and should be taken by the researcher community? Or how should they start to just sort of think a little bit differently about this space?

Yonatan Zunger: Well, I think there is so much opportunity for the research community to be involved in this. This is just - this is potentially a real golden era. And one thing I'll point out, is back in the early 2010s, when we were creating what we today call privacy engineering, which is quite a misnomer as a discipline, basically the people who are working on exactly these safety problems for social media, when that was the big new problem, you know, that discipline didn't really exist. And who are the people that we were hiring for it? Well, it was people who were good at thinking about how things will go wrong. SREs turned out to be really good at it. Lawyers, journalists, all sorts of people from all sorts of background, all sorts of walks of life. The common skill that really made people shine in the space was the ability to look at a system and think about what might go wrong. Right, it's doing that very first initial step that's often the hardest thing for people to do. And security researchers are wonderfully suited for exactly this kind of thing. And the biggest difference between safety research and security research is you just zoom out and look at a bigger scope of problems. Right? You know, my - my rule, always, for my teams is, like, they ask, like, well, is this kind of risk in-scope? And the answer is, well, does it involve your system? Is it a risk? Congratulations, it's in-scope, [laughter].

Nic Fillingham: [Laughter]

Wendy Zenone: [Laughter]

Yonatan Zunger: And I think with security research, we often get a little narrow, and we say, oh, well, you know, this is about access control issues, so that's security, but that's just about a human misusing the system in a way we didn't expect. That's - that's a product problem. Stop saying it. All of the problems are your problem. And now you ask, well, what can a security researcher do? You know, what - what can a safety researcher do? And the answer is, it's the stuff that you have been doing all of this time. If you are internal, obviously, you know, be a part of this, like, whole design process and so on. If you're external to a place, do safety. It's the same approaches that you take to doing security research. Probe systems, look for issues, look for vulnerabilities. Think about responsible disclosure. Same kind of approaches. Same - all - all of the muscles you have built for your security work over the decades, those same muscles apply here perfectly well. Do the exact same kind of thing. And you know, when - when you were dealing with the disclosure aspects, you know, some - sometimes it's very, very similar. You know, you find a system that it turns out you can make it misbehave in a way that people didn't think you could. Treat that like a security vulnerability. Disclose it responsibly, publish the results, et cetera, et cetera. Same thing you do. I think you're a little more likely to come across outright bad actors in the world, where it's like, you have discovered a problem in the system, and they say yes, we know. That's intentional. That doesn't happen quite as often in security world. I think one of the things that you'll encounter more and more in the safety world is really a misalignment of incentives kind of problems, where - you know, if you - often you'll have, you know, a maker of a system, and the user of a system, and people affected by a system, and you have, like, all of these different parties, and sometimes you'll have pairs of them whose incentives do not align, and those misaligned incentive moments, those are the places where the biggest problems often show up. Sometimes you'll have multiple groups of users whose incentives don't align with each other, right? Most social media problems are not because the company running the social network is evil, it's because one set of users is a problem for another set of users. Which is not, by the way, even saying that one set of users is bad actors. Culture clash is a great engine for that kind of thing too. So search for these places where something can go wrong, probe those things, do that research, and force the - no less important than all that is, find ways to fix problems. Right? Come up with techniques for mitigation. We are in such an open greenfield space in the world of generative AI. You can go out, discover a new problem, and figure out a way to mitigate this problem, to make a whole class of problems go away. Like, you know, I mean - this is like security research decades ago. It is like, you know, these very, very early days, where everything you're doing is really novel. So security researchers, please get involved, work actively, probe this, and just broaden the scope of what you think about from, you know, traditional security to safety in the broadest possible sense of the word.

Wendy Zenone: Just one comment is just that how important, I think, the security of AI is, because I know that I speak with people that take everything that comes out of AI literal. Like, it is - well, if ChatGPT says it, then it is, you know? [Laughter].

Yonatan Zunger: I remember back in the 90s, people were very worried, like in the late 90s, that like - oh my god, if something wrong shows up in a search result, Google said it. It must be true.

Wendy Zenone: Right, [laughter].

Yonatan Zunger: We're having the same thing. I can promise you that the output of an AI is no more guaranteed to be true than the output of a search engine, or for that matter, the output of a human.

Wendy Zenone: [Laughter].

Yonatan Zunger: And we, again, we - there's a whole hour of conversation we can have about problems of, like, overreliance, fabrication, the different specific things that can be going wrong. We can talk for hours and hours about all of this.

Wendy Zenone: And we will, [laughter].

Nic Fillingham: And I hope we do, Yonatan. Thank you so much. We are definitely going to have you back on another episode of The BlueHat Podcast. Just before we let you go, is there one go-do you would like our audience to do? Do you want them to read something? Do you want them to go watch something? What should everyone do to take the next step in, you know, securing AI?

Yonatan Zunger: You know, I wish that we had already published some - a book for the public about how to do all of this stuff, that I can tell you, go read this thing. But if I were to give people one go-do, it's go back to these projects - these products that you work with every day, and do that threat modeling exercise. Do that exercise on every - on every sort of thing you encounter. Think about ways things can go wrong. Get yourself into that mindset. Practice thinking about how things might fail. And with that, I think you will be in a spectacularly better place to really address the real problems that face us in the world.

Nic Fillingham: That's a wonderful end. Yonatan, thank you so much for being on The BlueHat Podcast. I - I look forward to our next episode. This has been fantastic. Thanks for your time.

Wendy Zenone: Thank you.

Yonatan Zunger: It is a real pleasure. [ Music ]

Wendy Zenone: Thank you for joining us for The BlueHat Podcast.

Nic Fillingham: If you have feedback, topic requests, or questions about this episode.

Wendy Zenone: Please email us at bluehat@microsoft.com, or message us on Twitter at msftbluehat.

Nic Fillingham: Be sure to subscribe for more conversations and insights from security researchers and responders across the industry.

Wendy Zenone: By visiting bluehatpodcast.com, or wherever you get your favorite podcasts. [ Music ]