Judging a Bug by Its Title
Nic Fillingham: Hello, and welcome to Security Unlocked, a new podcast from Microsoft where we unlock insights from the latest in news and research from across Microsoft Security engineering and operations teams. I'm Nic Fillingham-
Natalia Godyla: And I'm Natalia Godyla. In each episode we'll discuss the latest stories from Microsoft Security, deep dive into the newest threat, intel, research and data science-
Nic Fillingham: And profile some of the fascinating people working on artificial intelligence in Microsoft Security.
Natalia Godyla: And now let's unlock the pod.
Natalia Godyla: Hello, Nic. How's it going?
Nic Fillingham: Hello, Natalia. Welcome back. Well, I guess welcome back to Boston to you. But welcome to Episode 16. I'm confused because I saw you in person last week for the first time. Well, technically it was the first time for you, 'cause you didn't remember our first time. It was the second time for me. But it was-
Natalia Godyla: I feel like I just need to justify myself a little bit there. It was a 10 second exchange, so I feel like it's fair that I, I was new to Microsoft. There was a lot coming at me, so, uh-
Nic Fillingham: Uh, I'm not very memorable, too, so that's the other, that's the other part, which is fine. But yeah. You were, you were here in Seattle. We both did COVID tests because we filmed... Can I say? You, you tell us. What did we do? It's a secret. It is announced? What's the deal?
Natalia Godyla: All right. Well, it, it's sort of a secret, but everyone who's listening to our podcast gets to be in the know. So in, in March you and I will be launching a new series, and it's a, a video series in which we talk to industry experts. But really we're, we're hanging with the industry experts. So they get to tell us a ton of really cool things about [Sec Ups 00:01:42] and AppSec while we all play games together. So lots of puzzling. Really, we're just, we're just getting paid to do puzzles with people cooler than us.
Nic Fillingham: Speaking of hanging out with cool people, on the podcast today we have Mayana Pereira whose name you may have heard from a few episodes ago Scott Christiansen was on talking about the work that he does. And he had partnered Mayana to build and launch a, uh, machine learning model that looked at the titles of bugs across Microsoft's various code repositories, and using machine learning determined whether those bugs were actually security related or not, and if they were, what the correct severity rating should be.
Nic Fillingham: So this episode we thought we'd experiment with the format. And instead of having two guests, instead of having a, a deep dive upfront and then a, a profile on someone in the back off, we thought we would just have one guest. We'd give them a little bit extra time, uh, about 30 minutes and allow them to sort of really unpack the particular problem or, or challenge that they're working on. So, yeah. We, we hope you like this experiment.
Natalia Godyla: And as always, we are open to feedback on the new format, so tweet us, uh, @msftsecurity or send us an email firstname.lastname@example.org. Let us know what you wanna hear more of, whether you like hearing just one guest. We are super open. And with that, on with the pod?
Nic Fillingham: On with the pod.
Nic Fillingham: Welcome to the Security Unlocked podcast. Mayana Pereira, thanks for joining us.
Mayana Pereira: Thank you for having me. I'm so happy to be here today, and I'm very excited to share some of the things that I have done in the intersection of [ML 00:03:27] and security.
Nic Fillingham: Wonderful. Well, listeners of the podcast will have heard your name back in Episode 13 when we talked to Scott Christiansen, and he talked about, um, a fascinating project about looking for or, uh, utilizing machine learning to classify bugs based simply on, on their title, and we'll get to that in a minute. But could you please introduce you- yourself to our audience. Tell us about your title, but sort of what does that look like in terms of day-to-day and, and, and the work that you do for Microsoft?
Mayana Pereira: I'm a data scientist at Microsoft. I've been, I have been working at Microsoft for two years and a half now. And I've always worked inside Microsoft with machine learning applied to security, trust, safety, and I also do some work in the data privacy world. And this area of ML applications to the security world has always been my passion, so before Microsoft I was also working with ML applied to cyber security more in the malware world, but still security. And since I joined Microsoft, I've been working on data science projects that kinda look like this project that we're gonna, um, talk today about. So those are machine learning applications to interesting problems where we can either increase the trust and the security Microsoft products, or the safety for the customer. You know, you would develop m- machine learning models with that in mind.
Mayana Pereira: And my day-to-day work includes trying to understand which are those interesting programs across the company, talk to my amazing colleagues such as Scott. And I have a, I have been so blessed with an amazing great team around me. And thinking about these problems, gathering data, and then getting, you know, heads down and training models, and testing new machine learning techniques that have never been used for a specific applications, and trying to understand how well or if they will work for those applications, or if they're gonna get us to better performance, or better accuracy precision and those, those metrics that we tend to use in data science works. And when we feel like, oh, this is an interesting project and I think it is interesting enough to share with the community, we write a paper, we write a blog, we go to a conference such as RSA and we present it to the community, and we get to share the work and the findings with colleagues internal to Microsoft, but also external. So this is kinda what I do on a day-to-day basis.
Mayana Pereira: Right now my team is the data science team inside Microsoft that is called AI For Good, so the AI for Good has this for good in a sense of we want to, to guarantee safety, not only for Microsoft customers, but for the community in general. So one of my line of work is thinking about how can I collaborate with NGOs that are also thinking about the security or, and the safety of kids, for example. And this is another thing that I have been doing as part of this AI for Good effort inside Microsoft.
Natalia Godyla: Before we dive into the bug report classification project, can you just share a couple of the projects that your team works for AI for Good? I think it would be really interesting for the audience to hear that.
Mayana Pereira: Oh, absolutely. So we have various pillars inside the AI for Good team. There is AI for Health, AI for Humanitarian Action, AI for Earth. We have also been collaborating in an effort for having a platform with a library for data privacy. It is a library where we have, uh, various tools to apply the data and get us an output, data with strong privacy guarantees. So guaranteeing privacy for whoever was, had their information in a specific dataset or contributed with their own information to a specific research and et cetera. So this is another thing that our team is currently doing.
Mayana Pereira: And we have various partners inside and outside of Microsoft. Like I mentioned, we do a lot of work in NGOs. So you can think like project like AI for Earth several NGOs that are taking care of endangered species and other satellite images for understanding problems with the first station and et cetera. And then Humanitarian Action, I have worked with NGOs that are developing tools to combat child sexual abuse and exploration. AI for Health has so many interesting projects, and it is a big variety of projects.
Mayana Pereira: So this is what the AI for Good team does. We are, I think right now we're over 15 data scientists. All of us are doing this work that it is a- applied research. Somehow it is work that we need to sit down with, with our customers or partners, and really understand where the problem is. It's usually some, some problems that required us to dig a little deeper and come up with some novel or creative solution for that. So this is basically the overall, the AI for Good team.
Nic Fillingham: Let's get back in the way back machine to I think it was April of 2020, which feels like 700 years ago.
Mayana Pereira: (laughs)
Nic Fillingham: But you and Scott (laughs) published a blog. Scott talked about on Episode 13 called securing
Nic Fillingham: The s- the software development lifecycle with machine learning, and the thing that I think both Natalia and I picked up on when Scott was talking about this, is it sounded first-, firstly it sounded like a exceptionally complex premise, and I don't mean to diminish, but I think Natalia and I were both "oh wow you built a model that sort of went through repro steps and passed all the logs inside of security bugs in order to better classify them but that's not what this does", this is about literally looking at the words that are the title of the security bug, and then building a model to try and determine whether it was truly security or something else, is that right?
Mayana Pereira: That's exactly it. This was such an interesting project. When I started collaborating with Scott, and some other engineers in the team. I was a little skeptical about using only titles, to make prediction about whether a bug has, is security related or not. And, it seems. Now that I have trained several models and passed it and later retrained to- to get more of a variety of data in our model. I have learned that people are really good at describing what is going on in a bug, in the title, it feels like they really summarize it somehow so it's- it's doing a good job because, yes, that's exactly what we're doing, we are using bug titles only from several sources across Microsoft, and then we use that to understand which bugs are security related or not, and how we can have an overall view of everything that is happening, you know in various teams across different products. And, that has given a lot of visibilities to some unknown problems and some visibility to some things that we were not seeing before, because now you can scan, millions of bugs in a few seconds. Just reading titles, you have a model that does it really fast. And, I think it is a game changer in that sense, in the visibility and how do you see everything that is happening in that bug world.
Natalia Godyla: So what drove that decision? Why are we relying only on the titles, why can't we use the- the full bug reports?
Mayana Pereira: There are so many reasons for that. I think, the first reason was the fact that the full bug report, sometimes, has sensitive information. And we were a little bit scared about pulling all that sensitive information which could include passwords, could include, you know, maybe things that should not be available to anyone, and include that in a- in a VM to train a model, or, in a data science pipeline. And, having to be extremely careful also about not having our model learning passwords, not having that. So that was one of the big, I think incentives off, let's try titles only, and see if it works. If it doesn't work then we can move on and see how we can overcome the problem of the sensitive information. And it did work, when we saw that we had a lot of signal in bug titles only, we decided to really invest in that and get really good models by u- utilizing bug titles only.
Nic Fillingham: I'm going to read from the blog just for a second here, because some of the numbers here, uh, are pretty staggering, so, again this was written 2020, uh, in April, so there's obviously, probably updated numbers since then but it said that Microsoft 47,000 developers generate nearly 30,000 bugs a month, which is amazing that's coming across over 100 Azure DevOps and GitHub repositories. And then you had it you, you actually have a count here saying since 2001 Microsoft has collected 13 million work items and bugs which I just thinks amazing. So, do you want to speak to, sort of, the volume of inputs and, sort of, signals here in to building that model and maybe some of the challenges, and then a follow on question is, is this model, still active today, is this- is this work still ongoing, has it been incorporated into a product or another, another process?
Nic Fillingham: Do you want to start with, with numbers or.
Mayana Pereira: Yes, I think that from my data scientist point of view, having such large numbers is absolutely fantastic because it gives us a historical data set, very rich so we can understand how data has evolved over time. And also, if this- the security terminology has changed the law, or how long will this model last, in a sense. And it was interesting to see that you can have different tools, different products, different things coming up, but the security problems, at least for, I would say for the past six, seven years, when it comes to terminology, because what I was analyzing was the terminology of the security problems. My model was a natural language processing model. It was pretty consistent, so that was really interesting to see from that perspective we have. And by having so much data, you know, this amazing volume. It helped us to build better classifiers for sure. So this is my- my data scientist side saying, amazing. I love it so much data.
Nic Fillingham: What's the status of this project on this model now.? Is it- is it still going? Has it been embedded into another- another product, uh, or process?
Mayana Pereira: Yes, it's still active. It's still being used. So, right now, this product. This, not the product- the product, but the model is mainly used by the customer security interest team in [Sila 00:16:16], so they use the model in order to understand the security state of Microsoft products in general, and, uh, different products and looking at specific products as well, are using the model to get the- the bugs statistics and security bugs statistics for all these different products across Microsoft. And there are plans on integrating the- this specific model or a variation of the model into other security lifecycle pipelines, but this is a decision that is more on CST customer Security Trust side and I have, um, only followed it, but I don't have specific details for that right now. But, I have seen a lot of good interesting results coming out of that model, good insights and security engineers using the results of the model to identify potential problems, and fix those problems much faster.
Natalia Godyla: So, taking a step back and just thinking about the journey that your team has gone on to get the model to the state that it's in today. Uh, in the blog you listed a number of questions to figure out what would be the right data to train the model. So the questions were, is there enough data? How good is the data? Are there data usage restrictions? And, can data be generated in a lab?
Natalia Godyla: So can you talk us through how you answered these questions like, as a- as a data scientist you were thrilled that there was a ton of data out there, but what was enough data? How did you define how good the data was? Or, whether it was good enough.
Mayana Pereira: Great. So, those were questions that I asked myself before even knowing what the project was about, and the answer to is there enough data? It seemed very clear from the beginning that, yes, we had enough data, but those were questions that I brought up on the blog, not only for myself but for anyone else that was interested in replicating those experiments in their company or maybe university or s- anywhere any- any data scientist that is interested to train your own model for classification, which questions should be asked? Once you start a project like this. So the, is there enough data for me? Was clear from the beginning, we had several products so we had a variety of data sources. I think that when you reach, the number of millions of samples of data. I think that speaks for itself. It is a high volume. So I felt, we did have enough data.
Mayana Pereira: And, when it came to data quality. That was a more complex question. We had data in our hands, bugs. We wanted to be able to train a model that could different- differentiate from security bugs and non security bugs, you know. And, for that, Usually what we do with machine learning, is we have data, that data has labels, so you have data that represents security bugs, data that represents non security bugs. And then we use that to train the model. And those labels were not so great. So we needed to understand how the not so great labels was going to impact our model, you know, we're going to train a model with labels that were not so great. So
Mayana Pereira: That was gonna happen. So that was one of the questions that we asked ourselves. And I did a study on that, on understanding what is the impact of these noisy labels and the training data set. And how is it gonna impact the classification results that we get once using this, this training data? So this was one of the questions that I asked and we, I did several experiments, adding noise. I did that myself, I, I added noise on purpose to the data set to see what was the limits of this noise resilience. You know, when you have noisy labels in training, we published it in a, in an academic conference in 2019, and we understood that it was okay to have noisy labels. So security bugs that were actually labeled as not security and not security bugs labeled as security. There was a limit to that.
Mayana Pereira: We kinda understood the limitations of the model. And then we started investigating our own data to see, is our own data within those limits. If yes, then we can use this data confidentially to train our models. If no, then we'll have to have some processes for correcting labels and understanding these data set a little bit better. What can we use and what can we not use to train the models. So what we found out is that, we didn't have noisy labels in the data set. And we had to make a few corrections in our labels, but it was much less work because we understood exactly what needed to be done, and not correct every single data sample or every single label in a, an enormous data set of millions of entries. So that was something that really helped.
Mayana Pereira: And then the other question, um, that we asked is, can we generate data in the lab? So we could sometimes force a specific security issue and generate some, some box that had that security description into titles. And why did we include that in the list of questions? Because a lot of bugs that we have in our database are generated by automated tools. So when you have a new tool being included in your ecosystem, how is your model going to recognize the bugs that are coming from this new tool? So does our, ma- automatically generated box. And we could wait for the tool to be used, and then after a while we gathered the data that the tool provided us and including a retraining set. But we can also do that in the lab ecosystem, generate data and then incorporate in a training set. So this is where this comes from.
Nic Fillingham: I wanted to ask possibly a, a very rudimentary question, uh, especially to those that are, you know, very familiar with machine learning. When you have a data set, there's words, there is text in which you're trying to generate labels for that text. Does the text itself help the process of creating labels? So for example, if I've got a bug and the name of that bug is the word security is in the, the actual bug name. Am I jump-starting, am I, am I skipping some steps to be able to generate good labels for that data? Because I already have the word I'm looking for. Like I, I think my question here is, was it helpful to generate your labels because you were looking at text in the actual title of the bug and trying to ascertain whether something was security or not?
Mayana Pereira: So the labels were never generated by us or by me, the data scientists. The labels were coming from the engineering systems where we collected the data from. So we were relying on what- whatever happened in the, in the engineering team, engineering group and relying that they did, uh, a good job of manually labeling the bugs as security or not security. But that's not always the case, and that doesn't mean that the, the engineers are not good or are bad, but sometimes they have their own ways of identifying it in their systems. And not necessarily, it is the same database that we had access to. So sometimes the data is completely unlabeled, the data that comes to us, and sometimes there are mistakes. Sometimes you have, um, specific engineer that doesn't have a lot of security background. The person sees a, a problem, describes the problem, but doesn't necessarily attribute the problem as a security problem. Well, that can happen as well.
Mayana Pereira: So that is where the labels came from. The interesting thing about the terminology is that, out of the millions and millions of security bugs that I did review, like manually reviewed, because I kinda wanted to understand what was going on in the data. I would say that for sure, less than 1%, even less than that, had the word security in it. So it is a very specific terminology when you see that. So people tend to be very literal in what the problem is, but not what the problem will generate. In a sense of they will, they will use things like Cross-site Scripting or passwords in clear, but not necessarily, there's a security pr- there's a security problem. But just what the issue is, so it is more of getting them all to understand that security lingual and what is that vocabulary that constitutes security problems. So that's wh- that's why it is a little bit hard to generate a list of words and see if it matches. If a specific title matches to this list of words, then it's security.
Mayana Pereira: It was a little bit hard to do that way. And sometimes you have in the title, a few different words that in a specific order, it is a security problem. In another order, it is not. And then, I don't have that example here with me, but I, I could see some of those examples in the data. For example, I think the Cross-site Scripting is a good example. Sometimes you have site and cross in another place in the title. It has nothing to do with Cross-site Scripting. Both those two words are there. The model can actually understand the order and how close they are in the bug title, and make the decision if it is security or not security. So that's why the model is quite easier to distinguish than if we had to use rules to do that.
Natalia Godyla: I have literally so many questions.
Nic Fillingham: [laughs].
Natalia Godyla: I'm gonna start with, uh, how did you teach at the lingo? So what did you feed the model so that it started to pick up on different types of attacks like Cross-site Scripting?
Mayana Pereira: Perfect. The training algorithm will do that for me. So basically what I need to guarantee is that we're using the correct technique to do that. So the technique will, the machine learning technique will basically identify from this data set. So I have a big data set of titles. And each title will have a label which is security or non-security related to it. Once we feed the training algorithm with all this text and their associated labels, the training algorithm will, will start understanding that, some words are associated with security, some words are associated with non-security. And then the algorithm will, itself will learn those patterns. And then we're gonna train this algorithm. So in the future, we'll just give the algorithm a new title and say, "Hey, you've learned all these different words, because I gave you this data set from the past. Now tell me if this new ti- if this new title that someone just came up with is a security problem or a, a non-security problem." And the algorithm will, based on all of these examples that it has seen before, will make a decision if it is security or non-security.
Natalia Godyla: Awesome. That makes sense. So nothing was provided beforehand, it was all a process of leveraging the labels.
Mayana Pereira: Yes.
Natalia Godyla: Also then thinking about just the dataset that you received, you were working with how many different business groups to get this data? I mean, it, it must've been from several different product teams, right?
Mayana Pereira: Right. So I had the huge advantage of having an amazing team that is a data center team that is just focused on doing that. So their business is go around the company, gather data and have everything harmonized in a database. So basically, what I had to do is work with this specific team that had already done this amazing job, going across the company, collecting data and doing this hard work of harvesting data and harmonizing data. And they had it with them. So it is a team that does that inside Microsoft. Collects the data, gets everything together. They have their databases updated several times a day, um, collecting
Mayana Pereira: ... Data from across the company, so it is a lot of work, yeah.
Natalia Godyla: So do different teams treat bug reports differently, meaning is there any standardization that you had to do or anything that you wanted to implement within the bug reports in order to get better data?
Mayana Pereira: Yes. Teams across the company will report bugs differently using different systems. Sometimes it's Azure DevOps, sometimes it can be GitHub. And as I mentioned, there is a, there was a lot of work done in the data harmonization side before I touched the data. So there was a lot of things done to get the data in, in shape. This was something that, fortunately, several amazing engineers did before I touched the data. Basically, what I had to do once I touched it, was I just applied the data as is to the model and the data was very well treated before I touched it.
Nic Fillingham: Wow. So many questions. I did wanna ask about measuring the success of this technique. Were you able to apply a metric, a score to the ... And I'm, I, I don't even know what it would be. Perhaps it would be the time to address a security bug pre and post this work. So, did this measurably decrease the amount of time for prioritized security bugs to be, to be addressed?
Mayana Pereira: Oh, definitely. Yes, it did. So not only it helped in that sense, but it helped in understanding how some teams were not identifying specific classes of bugs as security. Because we would see this inconsistency with the labels that they were including in their own databases. These labels would come to this big database that is harmonized and then we would apply the model on top of these data and see that specific teams were treating their, some data points as non-security and should have been security. Or sometimes they were treating as security, but not with the correct severity. So it would, should have been a critical bug and they were actually treating it as a moderate bug. So, that, I think, not only the, the timing issue was really important, but now you have a visibility of behavior and patterns across the company that the model gives us.
Nic Fillingham: That's amazing. And so, so if I'm an engineer at Microsoft right now and I'm in my, my DevOps environment and I'm logging a bug and I use the words cross- cross scripting somewhere in the bug, what's the timing with which I get the feedback from your model that says, "Hey, your prioritization's wrong," or, "Hey, this has been classified incorrectly"? Are we at the point now where this model is actually sort of integrated into the DevOps cycle or is that still coming further down the, the, the path?
Mayana Pereira: So you have, the main customer is Customer Security and Trust team inside Microsoft. They are the ones using it. But as soon as they start seeing problems in the data or specific patterns and problems in specific teams' datasets, they will go to that team and then have this, they have a campaign where they go to different teams and, and talk to them. And some teams, they do have access to the datasets after they are classified by our model. Right now, there's, they don't have the instant response, but that's, that's definitely coming.
Nic Fillingham: So, Mayana, how is Customer Security and Trust, your organization, utilizing the outputs of this model when a, when a, when a bug gets flagged as being incorrectly classified, you know, is there a threshold, and then sort of what happens when you, when you get those flags?
Mayana Pereira: So the engineering team, the security engineering team in Customer Security and Trust, they will use the model to understand the overall state of security of Microsoft products, you know, like the products across the company, our products, basically. And they will have an understanding of how fast those bugs are being mitigated. They'll have an understanding of the volume of bugs, and security bugs in this case, and they can follow this bugs in, in a, in a timely manner. You know, as soon as the bug comes to the CST system, they bug gets flagged as either security or not security. Once it's flagged as security, there, there is a second model that will classify the severity of the bug and the CST will track these bugs and understand how fast the teams are closing those bugs and how well they're dealing with the security bugs.
Natalia Godyla: So as someone who works in the AI for Good group within Microsoft, what is your personal passion? What would you like to apply AI to if it, if it's not this project or, uh, maybe not a project within Microsoft, what is, what is something you want to tackle in your life?
Mayana Pereira: Oh, love the question. I think my big passion right now is developing machine learning models for eradication of child sexual abuse medias in, across different platforms. So you can think about platform online from search engines to data sharing platforms, social media, anything that you can have the user uploading content. You can have problems in that area. And anything where you have using visualizing content. You want to protect that customer, that user, from that as well. But most importantly, protect the victims from those crimes and I think that has been, um, something that I have been dedicating s- some time now. I was fortunate to work with an NGO, um, recently in that se- in that area, in that specific area. Um, developed a few models for them. She would attacked those kind of medias. And these would be my AI for Good passion for now. The other thing that I am really passionate about is privacy, data privacy. I feel like we have so much data out there and there's so much of our information out there and I feel like the great things that we get from having data and having machine learning we should not, not have those great things because of privacy compromises.
Mayana Pereira: So how can we guarantee that no one's gonna have their privacy compromised? And at the same time, we're gonna have all these amazing systems working. You know, how can we learn from data without learning from specific individuals or without learning anything private from a specific person, but still learn from a population, still learn from data. That is another big passion of mine that I have been fortunate enough to work in such kind of initiatives inside Microsoft. I absolutely love it. When, when I think about guaranteeing privacy of our customers or our partners or anyone, I think that is also a big thing for me. And that, that falls under the AI for Good umbrella as well since that there's so much, you know, personal information in some of these AI for Good projects.
Natalia Godyla: Thank you, Mayana, for joining us on the show today.
Nic Fillingham: We'd love to have you back especially, uh, folks, uh, on your team to talk more about some of those AI for Good projects. Just, finally, where can we go to follow your work? Do you have a blog, do you have Twitter, do you have LinkedIn, do you have GitHub? Where should, where should folks go to find you on the interwebs?
Mayana Pereira: LinkedIn is where I usually post my latest works, and links, and interesting things that are happening in the security, safety, privacy world. I love to, you know, share on LinkedIn. So m- I'm Mayana Pereira on LinkedIn and if anyone finds me there, feel free to connect. I love to connect with people on LinkedIn and just chat and meet new people networking.
Natalia Godyla: Awesome. Thank you.
Mayana Pereira: Thank you. I had so much fun. It was such a huge pleasure to talk to you guys.
Natalia Godyla: Well, we had a great time unlocking insights into security from research to artificial intelligence. Keep an eye out for our next episode.
Nic Fillingham: And don't forget to Tweet us at MSFTSecurity or email us at email@example.com with topics you'd like to hear on a future episode. Until then, stay safe.
Natalia Godyla: Stay secure.