Tackling Identity Threats With AI

Transcript

Nic: Hello and welcome to Security Unlocked. A new podcast from Microsoft where we unlock insights from the latest in news and research from across Microsoft Security engineering and operations teams. I'm Nick Fillingham.

Natalia: And I'm Natalia Godyla. In each episode, we'll discuss the latest stories from Microsoft Security, deep dive into the newest threat Intel research and data science.

Nic: And profile some of the fascinating people working on artificial intelligence in Microsoft Security. If you enjoy the podcast, have a request for a topic you'd like covered or have some feedback on how we can make the podcast better-

Natalia: Please contact us at securityunlocked@microsoft.com or via Microsoft Security on Twitter. We'd love to hear from you.

Nic: Hello, Natalia. Welcome to episode eight of Security Unlocked. How are you?

Natalia: I'm doing great. We're right about at Christmas. I am feeling it in my onesy right now.

Nic: You're feeling Christmas in your onesy? Is it a Christmas onesy?

Natalia: No. I feel like onesys just highlight the Christmas spirit. I mean, you're in PJs all weekend.

Nic: We've been in work from home for seven years now. We're all in perpetual onesy land.

Natalia: Well, I mean, I try to put in effort. I don't know about you.

Nic: I don't put any effort. I wonder if we should issue a subscriber challenge. I wonder if we could hit 1000 subscribers. We might make a security unlocked onesy. I wonder what other swag we could do? What would be good for a security unlocked podcast?

Natalia: All right. I mean, I guess I'm a little biased but the security blanket is clever. The ones that Microsoft gives away.

Nic: I don't think I have one of those.

Natalia: It's a blanket with security images on it.

Nic: Images of security in it? Just images of very strong passwords. Images of two factor authentication. What about a horse blanket? Like a blanket you put over your horse?

Natalia: What does that have to do with security?

Nic: Under the saddle. I'm just following the blanket thread, that's all. I'm just thinking different types of blankets. In two episodes have already talked about the bratty pigs. I wonder if we could turn the bratty pigs into our mascot and on the security blanket there could be like an animated picture of the bratty pigs running away with a padlock and key or something.

Natalia: Have I not, and excuse the pun, unlocked the new technology in blankets and animated pictures? Is that possible on blankets now?

Nic: Did I say animated? I meant illustrated, I'm sorry. Oh wow, I bet you there's some brand new piece of printing technology that's over in like Japan or South Korea that we haven't got over here yet where they've got animation on their blankets, that would be good. What about one of those automatic cat feeders for when you go away on holiday and it dumps a little bit of dry food into their bowl every 12 hours? And then we just put Security Unlocked on the side of it.

Natalia: As long as it has our logo on, it fits.

Nic: You know what? Also, this is our last episode for 2020.

Natalia: How'd you feel about it?

Nic: About this episode or about the year of 2020?

Natalia: Well, the year 2020 is probably too much to unpack. What about our podcast adventure in 2020?

Nic: Yeah, I've enjoyed it greatly. I listened to the first couple of episodes just the other day. And while they were great, I certainly heard an evolution in just eight episodes from that humble first back in October. So yeah, I've definitely enjoyed the trip. I'm very much looking forward to 2021. What about you?

Natalia: I feel like our guests are making me smarter. With each new episode. I've got a few more terms under the belt. Terms I'd heard before but never got that clarity from experts and what the definition is especially as they're moving around. We see that with a lot of the machine learning and AI terms. Like neural networks when we're talking to experts, they have different lenses on what that should mean.

Nic: The other thing that I found fascinating is everyone that you and I have reached out to internally, Natalia, and said, "Hey, do you want to be a part of this podcast?" Everyone said, Yes. Everyone has said, "Yeah, I'd love to share my story of how I got into security. I'd love to share my story of how I got to Microsoft." I love that we've spoken to such a incredible variety of people that have come to security and to Microsoft from just... I mean, everyone has a completely different story and everyone's been so willing to tell it. So I'm just very, very happy that we've been able to meet these great people and have these conversations.

Natalia: Yes. And even in their diversity, I've been happy to see that there are really positive themes across the folks that wants to be in security that are in the security space. They're all so passionate about what they do and really believe in the mission, which is just great to see. And like you said, there's just awesome community. The fact that they want to go out and have these conversations and are always open to receiving questions from you guys. So please keep them coming. Our experts are equally as hungry as we are to hear not just feedback but questions on the topics that we discuss.

Nic: So on today's episode, we chat with Maria Puertas Calvo. Fantastic conversation, very excited to have Maria on the podcast. I'm not sure if many folks picked up but a lot of the experts we've spoken to so far have been more on the endpoint detection side of the house. We've talked to folks over in the defender team and those who sort of look at the email pipeline. Maria and her team focused on identities, so protecting identities and protecting our identity platforms. And so she's going to talk about how AI and ML are used to protect identity. And then after Maria, we talked to...

Natalia: Jeff McDonald. So he is a member of the Microsoft defender for endpoint research team. And he's joined us on a previous episode to talk about unmasking malicious threats with MC and ML. And today, he's chatting with us about his career in cybersecurity, which started with game hacking. So making changes in the game to get more skills, get new characters and he's got some amusing stories as to how far he took that. But it's also a theme we're seeing across a few of our guests that game hacking seems to be a gateway to cyber security.

Nic: Yeah, hopefully the statute of limitations on game hacking has well and truly expired on the various games that Jeff mentions in his interviews. I hope we're not getting him in trouble. Enjoy the pod, and we'll see you all in 2021.

Nic: Maria Puertas Calvo, thank you so much for joining us. Welcome to the Security Unlocked podcast.

Maria Puertas Calvo: Hi, thank you for having me.

Nic: If you could tell us about your role at Microsoft and what your day to day looks like in the team you're in. The mission and sort of scope of that work, that'd be great.

Maria Puertas Calvo: Yeah, absolutely. So I am a principal data science manager in identity security and protection. So I lead a team of five data scientists that work within a big engineering team. And our big mission is to protect all of Microsoft's users from account compromise and other things like the abuse and fraud. As a data science team, we just analyze and look through all the huge amount of data that we get from all our customer logs and everything. And then we use that to build automated statistical based models or machine learning models or heuristic made models that are trying to detect those bad actions in our ecosystem. So compromised attacks or malicious bots that are trying to do bad things in our identity systems.

Natalia: And Maria, we understand that your team also recently authored a blog on enhanced AI for account compromise prevention. So can you talk a little bit about what that blog entails, how we're applying AI to start solving some of these problems?

Maria Puertas Calvo: Yeah, we're actually really excited about this work. But it just went into production recently and it has really enhanced what we call the bread and butter of really what we do. Which is trying to prevent compromise from happening in the ecosystem. Basically, we have been using artificial intelligence and AI to build detections for a pretty long time. And everything that we do, we try to start with whatever the long hanging fruit. We do offline detections, which are basically using the data after authentications or attacks already occurred and then detect those bad attacks and then we will inform the customer or make the customer reset their password or do some type of remediation.

Maria Puertas Calvo: But being able to put AI at the time of authentication and so meeting that end goal that we're trying to not just detect when a user has been compromised and remediate it but we're actually able to prevent the compromise from happening in the first place. So this new blog talks about this new system that we've built. We already had real time compromised detection but it wasn't using the same level of artificial intelligence.

Natalia: So is it correct to say then that in the past we had been doing is identifying a known attack, a known threat, and then producing detections based on that information and now we're trying to preempt it? So with this even more intelligent AI, we're trying to identify the threat as it's happening, is that correct?

Maria Puertas Calvo: Yeah, that's correct. So we did already have real time prevention but most of our artificial intelligence focus used to be in the, after the fact. Now we have been able to move this artificial intelligence focus also to the real time prevention. And what we have achieved with this has really improved the accuracy and the precision of this detection itself. Which means now we're able to say that the signings that we say are risky, they're way more likely to actually be bad than before. Before we would have more noise and more false positives and then we would also have some other bad activities that would go undetected.

Maria Puertas Calvo: With this new artificial intelligence system, we have really increased the precision. Which means, now if a customer says, "Oh, I want to block every single medium risk login that comes my way that is trying to access my tenant." Now, fewer of their real users are going to get blocked and more actual attackers are going to get blocked. So we've really improved the system by using this new AI.

Natalia: What's changed that's increasing the precision?

Maria Puertas Calvo: Yeah, so we actually published another blog with the previous system which was mostly using a set of rules based on user behavior analytics. So the main detection before was just using a few features of the signing itself and comparing them to the user history. So if you're coming from a new IP address, if you coming from a new location, if you're coming from a new device, there was like a deterministic formula. We were just using a formula to calculate a score which was the probability of how unfamiliar that finding was. Now we're taking way more inputs into account. So we're using... It depends on which protocol you're using.

Maria Puertas Calvo: It has more intelligence about the network, it has some intelligence about what's going on. for example, if you're coming from an IP address that has a lot of other traffic that AD is seeing, it has also information about what AD is saying from that IP address. Does it have a lot of failed logins or is it doing something weird? And then instead of us manually setting a mathematical formula or rules in order to build that detection, what we do is we train an algorithm with what is called label data. So label data is just a set of authentications, some are good and some are bad and they're labeled as such. So we use that label data to tell the algorithm, "Hey, use this to learn," Right? That's how machine learning works.

Maria Puertas Calvo: So the algorithm trains and then it's able to use that data to decide in real time if the authentication is good or bad.

Nic: Yeah, thank you. And then where, if any, do human analysts or humans in specialty roles, if it's data science or analytics, when do they come in to either verify the results or help with labeling new sets of data? So you've got your known goods, you've got your known bads and I assume you end up with a bunch of unknowns or difficult to classify one way or the other. Is that a role for a human analyst or human data scientists to come in and create those new labels?

Maria Puertas Calvo: Yeah, even though getting all this labels is extremely important. That is not really what... The data scientist is not there just classifying things as this is good, this is bad, just to get labels to feed it to the algorithm, right? What the data scientist does that is very crucial is to build the features and then train this machine learning model. So that is the part that is actually really important. And I always really try to have everybody in my team to really understand and become a great domain expert on two things, One is the data that they have to work with. It is not enough to just get the logs as they come from the system, attach the label to it and then feed it to some out of the box classifier to get your results.

Maria Puertas Calvo: That is not going to work really well because those logs by themselves don't really have a lot of meaning. If the data scientist is able to really understand what each of the data points that are in our laws, sometimes those values, they're not coded in there to be features for machine learning. They're just added there by engineers to do things like debugging or showing locks to the user. So the role of the data scientist is really to convey those data points into features that are meaningful for the algorithm to learn to distinguish between the attack or the good. And that is the second thing that the data scientist needs to be really good at. The data scientist needs to have a very good intuition of what is good and how that looks in the logs versus what is bad and how the looks in the logs.

Maria Puertas Calvo: With that knowledge basically knowledge of what the data in the logs mean and the knowledge of what attack versus good look in that data, then that is the feature engineering role. You transform those logs into all their data points that are calculations from those logs that are just going to have a meaning for the algorithm to learn if something is good or an attack. So I can give an example of this, it's very abstract. For example, when I see an authentication in Azure AD logs maybe one of the columns that I'd want him to know is like IP address, right? Every single communication over the internet comes from some client IP address which will be the IP address that's assigned to the device that you are on at the time that you're doing an authentication.

Maria Puertas Calvo: There are billions, if not trillions of IP addresses out there. And each one is just some kind of number that is assigned to you or to your device and it doesn't really have any meaning on its own. It's just like if you have a phone number, is that a good or a bad phone number? I don't know, that's just not going to help me. But if I can actually go and say, "Okay, this is an IP address but is this an IP address that Nick use yesterday or two days ago? How often have I seen Nick in this IP address? What was the last time I saw Nick in this IP address?" If you can just play with those logs to transform it into this more meaningful data, it's really going to help the model understand and make those decisions, right?

Maria Puertas Calvo: And then you also end up with fewer things to make decisions on, right? Because if I just had that one IP address to train the model, maybe my model will become really good at understanding which IP addresses are good and bad but only among the ones that we have used to train that model. But then when a new one comes in, the model doesn't know anything about that IP address, right? But if we instead change that into saying, "Okay, this is a known IP address versus an unknown IP address," And then now, instead of having trillions of IP addresses, we just have a value that says, Is it known or unknown. Then for every single new log in that comes in, we're going to be able to know if it's known or unknown.

Maria Puertas Calvo: We don't really need to have seen that IP address before, we just need to compare it to the user history and then make that determination of it is this known or unknown and that ends up being much more valuable for the model.

Natalia: So just mapping out the journey you've talked about. So we've gone from heuristics signature based detections to user analytics and now we're in a space where we're actively using AI but continuously optimizing what we're delivering to our customers. So what's next after this new release of enhanced AI? What is your team working on?

Maria Puertas Calvo: So lots of things but one thing that I am really interested in that we're working on is making sure that we're leveraging all the intelligence that Microsoft has. So for example, we built a system to evaluate in real time, the likelihood that a finding is coming from an attacker. But all of that is just using the data that identity processes like Azure Active Directory sign ins and what's happening the Azure Active Directory infrastructure. But there's so much more that we can leverage from what is happening across the ecosystem, right? Like the user who signs into Azure Active Directory is probably also coming in from a Windows machine that probably has Microsoft dependent Defender ATP installed on it. That it's also collecting signal and it's understanding what it's happening to the endpoint.

Maria Puertas Calvo: And at the same time, when the sign in happens then the sign in doesn't happen just to go to Azure AD, right? Azure AD is just the door of entry to everything, Usher, Office, you name it. Third party applications that are protected by things like Microsoft Cloud App Security. And all of the security features that exist across Microsoft are building detections and collecting data and really understanding in that realm, what are the security threats and what's happening to that user? So there is a journey, right? Of that sign in. It's not just what's happening in Azure AD but it's everything that's happening in the device. What's happening in the cloud and in the applications that are being accessed after.

Maria Puertas Calvo: So we're really trying to make sure that we are leveraging all that intelligence to enhance everything that we detect, right? And that way, the Microsoft customer will really benefit from being a part of the big ecosystem and having that increased intelligence should really improve the quality of our risk assessment and our compromise detections.

Nic: Maria, how much of this work that you talked about in the blog and the work that your team does is trying to mitigate the fact that some folks still don't have multi factor authentication? Is any of this a substitute for that?

Maria Puertas Calvo: We know from our own data studies that accounts that are protected by multi factor authentication, which means every time they log in, they need to have a second factor, those accounts are 99.9% less likely to end up compromised because even if their password falls in the hands of a bad actor or get gassed or they get phished, that second factor is going to protect them and it's way more likely to stop the attack right there. So definitely, this is not supposed to be a substitute of multi factor authentication. Also, because of that, our alerts do not... They still will flag a user if the sign in was protected by multi factor authentication but the password was correct. Because even if there's multi factor authentication, we want to make sure that the user or the admin know that the password was compromised so they're able to reset it.

Maria Puertas Calvo: But the multi factor authentication is the tool that is going to prevent that attack. And you asked earlier about what's next in other feature things and one thing that we're also really working on is, how do we move past just detecting these compromises with the password of using multi factor authentication as a mitigation of this risk, right? Like the way a lot of the systems are implemented today is if you log in and we think your log in is bad but then you do MFA. That is kind of like a reassuring things that we committed a mistake, that was a false positive and that's a remediation event. But the more people move to more MFA and more password less, our team is starting to think more and more of what's the next step?

Maria Puertas Calvo: How are attackers are going to move to attacking that multi factor authentication. It is true that multi factor authentication protects users 99.9% of the time today but as more people adopt it, attackers are going to try to now move to get to bypass our multi factor authentication. So there's many ways but the most popular multi factor or second factor that people have in their accounts is telephone based. So there's SMS or there's a phone call in which you just approve the Sign In. There are phishing pages out there that are now doing what is called real time men in the middle attack in which you put your username and password, the attacker grabs it, puts it in the actual Azure AD site and then now you're being asked to put your SMS code in the screen. So the attacker has that same experience in their phishing site, you put in your code and the attacker grabs the code and puts it in Azure AD sign in page and now the attacker has sign in with your second factor, right?

Maria Puertas Calvo: So two challenges that we're trying to tackle is, one, how do we detect that this is happening? How do we understand that when a user uses their second factor, that is not a mitigation of the risk? It's more and more possible with time that attackers are actually also stealing this second credential and using it, right? So we need to make more efforts in building those detections. And the second really big thing is, what then, right? Because if we actually that the attacker is doing that, then what is the third thing that we asked you? Now you've given us a password, you've given us a second factor, if we actually think that this is bad, but it is not. What is the way for the user to prove that it's them, right?

Maria Puertas Calvo: So we need to move and I think this is extremely interesting, we need to move to from a world in which the password is the weak crab and everything else is just considered good. which today, it's very true. If you have a second factor, that is most likely going to be just fine but in the future, we we need to adapt to future attacks in which this won't be the case. So we need to understand what is the order of security of the different credentials and what is the remediation story for attacks that are happening with these second factors.

Nic: I'd like to propose that third challenge, that third factor, should be a photograph of you holding today's newspaper doing the floss or some other sort of dance craze that's currently sweeping the nation.

Maria Puertas Calvo: Sure, we'll add it to the bar code.

Nic: I think that would just stamp out all identity theft and fraud. I think I've solved it.

Maria Puertas Calvo: You did. I think so.

Natalia: I think you'll be bringing back newspapers along with it.

Nic: Yes. Step one is to reinvigorate the print newspaper industry. That's the first step of my plan but we'll get there.

Natalia: So Maria, in your endeavors? How are you measuring success, for instance, of the new enhanced AI that your team has developed?

Maria Puertas Calvo: Yeah, so our team is extremely data driven and metric driven and everything we do, we're trying to improve on one metric, right? The overall team mission really is to reduce the amount of users who fall victims of compromised account or what we call unauthorized access. So we have a metric that we all review every single day, we have a huge dashboard that is everybody's homepage in which we see in the last three months, what percentage of our monthly active users fell victim to compromised account and our main goal is to drive that metric down. But that is really the goal of the whole team including the people who are trying to make users adopt MFA and conditional access and other types of security measures.

Maria Puertas Calvo: When we look into detection metrics and the ones like the AI detection metrics, we mostly play with those precision and recall metrics that are also explained in the blog. So precision is the percentage of all of the detected users or detected signings that you detected as bad that are actually bad, right? Out of everything that, let's say, you would block, how many of those were actually bad? So it really also tells you how much damage you're doing to your good customers. And the other one is recall and recall is out of all the bad activities that are out there, so let's say all the bad sign ins that happen in a day, how many of those that your system catch?

Maria Puertas Calvo: So it's a measure of how good you are at detecting those bad guys. And the goal is to always drive those two numbers up. You want to be really high precision and you want to be really high recall. So every time we'll have a new system and a new detection or whatever it is or we perform improvements in one of our detection, those are the two metrics that we use to compare the old and the new and see how much we've improve.

Natalia: And how are we getting feedback on some of those measures? And what I mean by that is the first one you mentioned. So precision, when you're saying how many were actually bad and we need to figure out how many were the true positive? How do we know that? Are we getting customer feedback on that or is there a mechanism within the product that lets you know that it was truly a bad thing that was caught?

Maria Puertas Calvo: Yeah, so the same label and mechanisms that I was talking about earlier that we need both labels to be able to train or supervise machine learning models, we also need those labels in order to be able to evaluate the performance of those machine learning models. So knowing at least for a set of our data, how much is good and how much is bad and understanding what our systems are doing to detect the good and the bad. So one of the mechanisms is, as I was talking, the manual labeling that we have in place but the other one you mentioned is customer feedback, absolutely. Actually, one of the first thing we did when we launched editor protection is to include feedback buttons in the product.

Maria Puertas Calvo: All of our detections actually go to an Azure Portal UX in the identity protection product and admins there can see all of the risky sign ins and all of the risky users and why they were detected as risky. Everything that my team is building gets to the customer through that product. And that's where the admin can click buttons like confirm safe or confirm compromised. Those are labels that are coming back to us. And users now also, there's a new feature in entity protection called My Finance. And users can go to my sign ins and look at all their recent signings that they did and they can flag the ones that they think it wasn't them. So if they were compromised, they can tell us themselves, this was not me.

Maria Puertas Calvo: So that is another avenue for us to understand the quality of our detections. And then we're extremely customer obsessed as well. So even, it's not just the PMs in our team who have customer calls. The data scientists, many, many times get on calls with customers because the customers really want to understand what's the science behind all of these detections and they want to understand how it works. And the data science teams also wants the feedback and really understand what the customer thinks about the detection. If we're having false positives, why is that? It's really challenging too in the enterprise world because every tenant may have a different type of user base or different type of architecture, right?

Maria Puertas Calvo: We had a time that we were tracking... We always track what are the top 10 tenants that get flagged by the detections. For example, airlines used to be a big problem for us because they had so much travel that we had a lot of false positives, right? We were flagging a lot of these people who because they're flying all over the world and signing in from all over the world. So it would trigger a lot of detections but there are other customers in which this is not the case at all. All of their users stay put and they're just only logging in from the corporate network because it's a very protected environment. So this quality of detections and this precision and recall can really vary customer by customer.

Maria Puertas Calvo: So that is another challenge that I think we need to focus more in the future. How do we tune our detections in order to make more granular depending on what the industry is or what type of setup the customer or the tenant has.

Nic: Changing subjects just a little bit and maybe this is the last question, Maria. I noticed on your Twitter profile, you refer to yourself as a guacamole eater. I wondered if you could expand upon that. There are very few words in your bio but there's a lot of thought gone into those last two words. Tell us about eating guacamole.

Maria Puertas Calvo: Well, what can I say? I just really love guacamole. I think I may have added that about a year ago, I was pregnant with my twins who were born five months ago and when you're pregnant with twins they make you eat a lot of calories, about 3000 calories a day. So one of the foods that I was eating the most was guacamole because it's highly nutritious and it has a lot of calories. I went on a quest to finding the best recipe for guacamole and-

Nic: Okay, walk us through your best guacamole recipe. What's in it?

Maria Puertas Calvo: Absolutely. So the best guacamole recipe has obviously avocado and then it has a little bit of very finely chopped white onion, half jalapeno, cilantro and lime and salt. That's it.

Nic: No tomatoes?

Maria Puertas Calvo: No tomatoes. The tomatoes only add water to the guacamole, they don't add any flavor.

Nic: What about then a sun dried tomato? No liquid, just the flavor? Is that an acceptable compromise?

Maria Puertas Calvo: Absolutely not. No tomatoes in guacamole. The best way to make it is, you first mash the jalapeno chili with the cilantro and the onion almost to make a paste and then you mix in the avocado and then you finally drizzle it with some lime and salt.

Nic: Hang on. Did you say garlic or no garlic?

Maria Puertas Calvo: No garlic, onion.

Nic: No garlic, I see. So the onion is the substitute for I guess that's a savoriness? I don't know how you classify... What's garlic? Is it Umami? I don't know the flavor profile but no garlic? Wow, I'm making guacamole when I'm at my house.

Natalia: Well, you heard it here first guys. Maria's famous guacamole recipe.

Nic: I think we'll have to publish this on Twitter as a little Easter eggs for this episode. It'll be Maria's definitive guacamole recipe.

Maria Puertas Calvo: Now the secret is out.

Nic: Well, Maria, thank you so much for your time. This has been a fantastic chat I think. I have a feeling we're going to want to talk to you again on the podcast. I think we'd love to hear a bit more about your personal story and I think we'd also love to learn more about some of the AI techniques that you talked to us about but thank you so much for your time.

Maria Puertas Calvo: Yeah, of course, this was a pleasure. I had a great time and I'll come back anytime you want me. Thank you.

Natalia: And now let's meet an expert from the Microsoft Security Team to learn more about the diverse backgrounds and experiences of humans creating AI and tech at Microsoft. Today, we're joined by Jeff McDonald, who joined us on a previous episode, unmasking malicious scripts with machine learning to talk to us about anti-malware scan interface or AMC. Thank you for joining us again on the show, Jeff.

Geoff McDonald: Yeah. Thank you very much. I really enjoyed being here last time and excited to be here again.

Natalia: Great. Well, why don't we start by just giving a quick refresher to our audience? Can you share what your role and day to day function is at Microsoft?

Geoff McDonald: I lead a team of machine learning researchers and we build our machine learning defenses for Microsoft Defender antivirus product. So we built lightweight machine learning models which go into the antivirus product itself which run on your device with low memory and lower CPU costs for inference. We also deploy a lot of machine learning models into our cloud protection platform where we have clusters of servers in each region around the world. So that when you're scanning a file or behavior on your device, it sends metadata about the encounter up to our cloud protection in real time to the closest cluster to you. And then we do real time running of all of our machine learning models in the cloud to come back with a decision about whether we should stop the behavior or attack on your device.

Geoff McDonald: So we're a small team of probably about five of us. We're a mix of threat researchers and machine learning and data science experts. And we work together to design new protection scenarios in order to protect our customers using machine learning.

Nic: Jeff, when you go to a security conference, some kind of industry get together, do you describe yourself as a machine learning engineer? What do you use when you're talking to other security professionals in your field? Is machine learning... Is it sort of an established subcategory or is it still sort of too nascent?

Geoff McDonald: Yeah. I used to call myself maybe a threat researcher or a security researcher when I would present at conferences and when I would introduce myself. But I'd say nowadays, I'd be more comfortable introducing myself as a data scientist because that's my primary role now. Although I come from a very strong background in the security and security research aspect, I've really migrated to an area of work where really machine learning and data science is my primary tool.

Natalia: What's driven that change? What prompted you to go deeper into data science as a security professional?

Geoff McDonald: So when I first started at Microsoft, I was a security researcher. So I would do a reverse engineering of the malware itself. I would do heuristics, deep analysis of the attacks, and threat families and prepare defenses for them. So I think learning pretty early on while doing all the research in response to these attacks, it was very clear that the human analysis and defense against all these attacks was really not scalable to the scale that we needed. So it really had to be driven by automation and machine learning, in order to be able to provide a very significant protection level to our customers. So I think that really drove the natural solution where all these human resources, these manual analysis doesn't scale to where we need it to be and where we want our protection level to be.

Geoff McDonald: So it really encouraged finding the automation and machine learning solution. And I have previously had some experience with machine learning. At the time, it was kind of a natural fit where I began a lot of exploration of the machine learning application to protect it against these threats and then pivoted into that as my primary role eventually, as it was quite successful.

Natalia: So your unique set of skills, data science and security, is one that's definitely sought after in the security space. But considering the fact that we're still trying to fill just security jobs, it's definitely a challenge. So do you have any recommendations for companies that are looking for your set of skills and can't find a unicorn like yourself that has both? And if were looking for multiple people, how should these teams interact so that they're leveraging both skills to protect companies?

Geoff McDonald: When we look to fill new positions on our team, we try to be really careful to try to be as inclusive as possible to a lot of different candidates. So when we're pushing our new data science positions where we're looking for the data science experience, like in the machine learning and data science application, you'll see in our job applications, we don't actually require cybersecurity experience for our job positions. We're really looking for someone who has a really great understanding of the data and good understanding of ML. And being able to have a strong coding background in order to be able to implement these pipelines and machine learning models and try out their experiments and ideas in ways that they can implement and take them end to end to deploying them.

Geoff McDonald: So really, for people that were looking to join our team, often, you don't actually necessarily have to have a background in cybersecurity for all of our positions. Sometimes we're looking for really strong data scientists who can pick up the basics of security and apply it in a very effective way. But we would also want our team have different sets of people who are more experienced in the security background to help drive some of the product and feature and industry and security trends for the team as well. Our team currently has quite a mix of backgrounds where there's some threat researchers and there's some pure data scientists who have come from related fields who actually haven't come from a cybersecurity background specifically.

Nic: I wonder if we can back it up. If we can go back in time and start with you, your story, how did you first get into security, get interested in security? Did it start in elementary school? Did it start in high school? Did it start in college? Did you go to college? Can we back up and learn about the young Jeff McDonald?

Geoff McDonald: I grew up in a small town near Calgary, Alberta, Canada. I guess it started with my family being a software developing family, I would say. Like my dad had his own software company and as a result, we were really lucky to have the opportunity to learn to code from a young age. So, we would see our dad coding, we knew that our dad coded so we're really interested in what he was doing and we wanted to be able to learn and participate.

Nic: When was that Jeff? We're talking in 80s, 90s?

Geoff McDonald: So that would be when I was probably around 10 years old when I started coding. And that would be I guess, 96 or so.

Nic: I'm trying to learn like was that on some cool, old Commodore 64 hardware or were we well and truly in the x86 era at that point?

Geoff McDonald: Yeah. I mean, an x86 I do believe. So it's just Visual Basic which is very simple coding language. The classic Visual Basic 6.0, we're really lucky to be able to learn to code at a pretty young age, which is awesome. And although my brother went more into... My older brother was about two years older, a big influence on me coding wise as well. He was really into making, you might say, malware. We both had our own computers, we had often tried to break into each other's computers and do things. My brother created some very creative hacks, you can say. Like, one thing I remember is he burned a floppy disk, which would have an autorun on it and the way that I'd protect my computer is a password protected login.

Geoff McDonald: But back in those days, I think it was windows 98 at the time, it really wasn't a secure way of locking your computer where you have to type in your password. You can actually insert a diskette and would run the autorun and you could just terminate the active process. So my brother created this diskette and program, which would automatically be able to bypass my security protocols and my computer, which I thought was pretty funny.

Nic: Is he still doing that today? Is he still red teaming you?

Geoff McDonald: No. Not red teaming me anywhere, luckily.

Natalia: So what point were you like, "Well, all of these things that I've been doing actually apply to something I want to be doing for a career?"

Geoff McDonald: Yeah. So although was in a really software development friendly household. My dad was really concerned about the future of software development. He was discouraging us from going into software development as a primary career path at the time. Going into university I was mostly considering between engineering and business. I ended up going into engineering because I really liked the mathematical aspect of my work and it is a mix of coding and math, which is kind of my two strong suites. So I went into electrical engineering program, during my electrical engineering for four years is when I really changed from doing game hacking as my hobby to doing software development for reverse engineering tools. So as my hobby, I would create a reverse engineering tools for others to use in order to reverse engineer applications. So I went to universities in Calgary, Alberta there. And in Alberta, the primary industry of the province is oil and-

Nic: Is hockey.

Geoff McDonald: Good one. Yeah. So in Alberta, the primary industry in the sector is really oil and gas. There's a lot of oil and gas, pretty much all engineers when they graduate, the vast majority go into the oil and gas industry. So really, that's what I was thinking of that I'd probably be going into after I graduate. But either way, I continued the reverse engineering tool development, I did some security product kind of reverse engineering ideas as well. Approaching graduation, I was trying to figure out what to do with my life. I loved control systems, I loved software development, I loved the mathematical aspects and I want to do grad school. So then I looked at programs in security because my hobby of reverse engineering security, I didn't really take very seriously as a career.

Geoff McDonald: I didn't think it could be a career opportunity, especially being in Alberta, Canada where oil and gas is the primary sector, there's not much in the way of security industry work to be seen as far as I could tell at the time in the job postings and job boards. So I ended up going for a master's in control systems continuing electrical engineering work. So basically, it's more like signal processing work where you're doing analyzing signals doing fault detection, basically, mount vibration sensors to rotating machines was my research. And then from the vibration signal, you're trying to figure out if there's a fault inside the motor or the centrifuge or the turbine or whatever it's attached to.

Geoff McDonald: And in that field, there was a lot of machine learning in the research area. So that's where I got my first exposure to machine learning and I loved machine learning but that wasn't my primary research focus for my topic. And then approaching graduation, I started looking at jobs and I happen to get really lucky at the time that I graduated because there happened to be a job posting from Symantec in Calgary. And when looking at the requirements for the job postings, it had all of the reverse engineering tools and assembly knowledge and basically everything I was doing as a hobby, had learned through game hacking and developing these reverse engineering tools. It was looking for experience in only debug assembly. I'm like, "Oh, my goodness. I have all those skills. I can't believe there's actually a job out there for me where I could do my hobby as a career." So I got really lucky with the timing of that job posting and so began my career in cybersecurity instead of oil and gas.

Nic: So you talked about the adding sensors parts to, I guess, oil and gas related sort of instrumentation. And then there was some machine learning involved in there. Is that accurate? So can you expand upon that a little bit, I'd love to learn what that look like.

Geoff McDonald: So basically, the safety of rotating machines is a big problem. There was an oil and gas facility actually in Alberta which has centrifuges which spins the... I'm sure I'm not using the right terminology, but it spins some liquid containing gas to try to separate the compounds from the water, I think. And they had one of these... Actually, the spindle of the centrifuge broke and then it caused an explosion in the building and some serious injuries. So it was really trying to improve the state of the art of the monitoring of the health of a machine from the mounted accelerometers to them.

Geoff McDonald: Two of the major approaches were machine learning, where you basically create a whole bunch of handcrafted features based on many different techniques and approaches and then you apply a neural network or SVN or something like that to classify how likely it is that the machine is going to have a failure or things like that. Now, I think at the time the machine learning was applied but it wasn't huge in the industry yet because machine learning in application to signals, that was, especially in convolutions, not as mature as it is now. The area I was working on was de-convolutions. A lot of machine learning models involve doing... At least a lot of machine learning models nowadays would approach that problem as a convolutional neural network. The approaches that I was working on next one was called a de-convolution approaches.

Geoff McDonald: So I was able to get a lot of very in depth research into convolutions and what the underlying mean. And that has helped a lot with the latest model architectures where a lot of the state of the art machine learning models are based on convolutions.

Natalia: So what was that a convolutional neural network? Can you define what that is?

Geoff McDonald: So convolution is basically where you're applying a filter across the signal. It could be an image or it could be a one dimensional signal. So in this case, it's a one dimensional signal where you have... Well, at least it's a one dimensional signal if you have a single accelerometer on a single axis for the machine. You think of it like the classic ECG where you have a heartbeat going up and down. It's kind of like that kind of signal you can imagine which is the acceleration signal. And then you basically learn to apply a filter to it in order to maximize something. What filter you apply can be learned in different ways. So in a convolutional neural network, you might be learning the weights of that filter, how that filter gets applied based on back propagation through whatever learning goal you're trying to solve.

Geoff McDonald: In a typical CNN model, you might be learning something like 1000 of these filters where you're adjusting the weights of all these filters through back propagation according to... To try to minimize your loss function. I guess in my research area, I was working to maximize, design a filter through de-convolution to maximize the detection of periodic spikes in the vibration signal. Meaning that something like an impact is happening every cycle of the rotor, for example.

Nic: Well, so convolution is a synonym for sort of complexity. So de-convolution, is that a oversimplification to say that it's about removing complexity and sort of filtering down into a simpler set, is that accurate?

Geoff McDonald: I wouldn't say it's so similar to the English language version of it. It's a specific mathematical operator that we apply to a signal. So it's kind of like you're just filtering a signal. And de-convolution is sort of like de-filtering it. It's my best way to describe it.

Nic: Oh, right. Okay, interesting. De-filtering it. Could you take a stab at just giving us your sort of simplest if possible definition of what a neural network is?

Geoff McDonald: Okay. A simplest stab of a neural network, okay.

Nic: And Jeff, there's very few people have asked that question of but you're one of them.

Geoff McDonald: Okay, cool. When you look at the state of the art, you'll actually find that neural networks themselves are not widely used for a lot of the problems. So when it comes to like a neural network itself, the best way I might describe it is that it's basically taking a bunch of different inputs and it's trying to predict something. It could be trying to predict the future stock price of Tesla, for example, if they're trying to predict whether Tesla's going to go up or down or they could be trying to predict it. Especially in our Microsoft defender case, we're trying to predict, "Based on these features, is this malicious or not?" Is our type of application.

Geoff McDonald: So it's going to mean taking a whole bunch of inputs like, "Hey, how old is this file in the world? how prevalent is this file in the world? What's its file size? And then what's the file name?" Well, maybe I'll say, "Who's the publisher of this file?" Well, it's going to take a whole bunch of inputs like that and try to create a reasoning... It's going to try to learn a reasoning from those inputs to whether it's malware or not as the final label. We do it through a technique called back propagation because we have imagined a million encounters where we have those input features. So then we use these known outputs and inputs in order to learn a decision logic to best learn how to translate those inputs to whether it's Malware or not.

Geoff McDonald: So we do this through a lot of computers or sometimes GPUs as well in order to learn that relationship. And a neural network is able to learn nonlinear relationships and co-occurrences. So for example, it's able to learn a logic like is it more than 10,000 file size? And is the publisher not Microsoft? And the age is less than seven days, then we think it's 70% malicious. So it's able to learn sort of more complex logic like that, where it can create and conditions and create more complex logic depending on how many layers you have in that neural network.

Natalia: Do you think there's a future for neural networks? It sounds like right now you see a specific set of use cases like image recognition but for other use cases it's been replaced. Do you think the cases you described right now like image recognition will eventually be replaced by other techniques other than neural networks?

Geoff McDonald: I think they'll always play a role or derivatives of them will play a role. And it's not to say that we don't use neural networks at all. Like in our cloud protection platform, you'll find tons of logistic regression single neuron models, you'll find GBM models, you'll find random forest models. And we've got our first deep learning models deployed. Some of our feature sets have a lot of rich information to them and are really applicable to the CNN, the convolutional neural network model architecture and for those, we will have a neural network at the end of the month. So it still definitely plays its specialty role but it's not necessarily what's driving the bulk of protection. And I think you'll probably find the same for most machine learning application scenarios around the industry. That neural network is not key to most problems and that it's not necessarily the right tool for most problems but it does still play a role and it definitely will continue to play a role or derivatives of it.

Nic: My brain's melting a bit.

Natalia: I want to ask for a definition of almost every other term but I'm trying to hold back a bit.

Nic: Yeah, I've been writing down like 50 words that Jeff has mentioned like, "Nope, I haven't heard that one before. Nope, that one's new." I think, Jeff, you've covered such a lot of fascinating stuff. I have a feeling that we may need to come back to you at other points in the future. If we sort of look ahead more in general to your role, your team, the techniques that you're sort of fascinated in? What's coming down the pike? What's in the future for you? Where are you excited? What are you focused on? What are you going to see in the next six, 12 18, 24 months?

Geoff McDonald: One of the big problems that we have right now is adversaries. So what malware attackers do is that they build new versions of their malware then they check if it's detected by the biggest antivirus players. And then if it's detected by our AV engines, what they do is they keep building new versions of their malware until it's undetected. And then once it's undetected, they attack or customers with it and then repeat. So this is been the cat and mouse game that we've been in for years, for 10 years at least. Now, what really changed about six years ago is that we put most of our protection into our cloud protection platform. So if they actually want to check again, so like our full protection, and especially our machine learning protection, they have to be internet connected so they can communicate with a real time Cloud Machine Learning protection service.

Geoff McDonald: And what this means is if they want to test their malware against our defenses before they attack our customers, it means that they're going to be observable by us. So we can look at our cloud protection logs and we can see, "Hey, it looks like someone is testing out their attack against our cloud before they attack our customers." So it makes them observable by us because they can't do it in a disconnected environment. Originally, when we came out with cloud protection, it seems like the adversaries were still testing in offline environments. Now we've gotten to the point where so many of the advanced adversaries as well as commodity adversaries are actually pre-testing their attacks against our cloud defenses before the attack our customers. And this introduces a whole bunch of adversarial ML and defensive strategies that we're deploying in order to stay ahead of them and learn from their attacks even before they attack our customers.

Geoff McDonald: So we have a lot of machine learning and data science where we're really focused on preventing them from being able to effectively test with our cloud as a way to get an advantage when attacking customers. So that's one that we have a lot of work going into right now. A second thing that I really worry about for the future, this is like the really long term future, hopefully it won't be a problem for at least another decade or two or even hopefully longer. But having reinforcement learning, if we have some big breakthroughs, where we're able to use reinforcement learning in order to allow machine learning to learn new attacks by itself and carry out attacks fully automated by itself by rewarding it.

Geoff McDonald: Luckily, right now, our machine learning or reinforcement learning state of the art is not anywhere close to the technology that would be needed to be able to teach an AI agent to be able to learn new attacks automatically and carry them out effectively. At least nowhere close to the effectiveness of a human at this point. But if we get to the level of effectiveness where we can teach an AI to come up with and explore new attack techniques and learn brand new attack techniques and carry out the attacks automatically, it could change the computing world forever, I think. We might be almost going back to the point where we have to live on disconnected computers or extremely isolated computers somehow but it would be kind of like a worst case scenario where machine learning has allowed the attackers to get to the point where they can use AI to automate everything and learn new attack techniques, learn new exploits, and et cetera, entirely by itself which would be a humongous problem for defensiveness.

Geoff McDonald: And there's a lot of ongoing research in this right now but it's very much on the defensive side where, "Hey, we're going to use reinforcement learning to teach an attacker so that we can learn from defending against it automatically." That hypothesis is great but it's been created with the goal of trying to improve our defenses. But actually, it's also building the underlying methods needed in order to carry out attacks automatically by itself. And I think if we get to that point, it's a really big problem for security. It's going to revolutionize the way computer security works.

Nic: Well, hopefully, Jeff, you and your colleagues remain one or two steps ahead in that particular challenge?

Geoff McDonald: Yeah, we will.

Nic: I hope you share that goal. Jeff, what are you and your team doing to make sure that you stay ahead of your sort of adversarial counterparts that are looking to that future? What gives you hope that the security researchers, the machine learning engineers, the data scientists are, hopefully, multiple steps ahead of adversaries out there?

Geoff McDonald: I think our adversary situation is much better than it used to be back in the day. Back in the day, they'd be able to fully test our defenses without us even being able to see it. And now that we've forced them into the game of evading our cloud protection defenses, it allows us to observe them even before they attack our customers. So the defenses we have in place that we've already shipped as well as a lot of what we have planned is really going to be a real game changer into the way that we protect our customers where we can actually protect them even before our customers are attacked. So we're in a much better defensive situation since we're able to observe them before the attack our customers nowadays.

Natalia: Thank you, Jeff, for joining us on today's show. As always, it was fantastic chatting with you and like Nick said, definitely need to have you back on the show.

Geoff McDonald: Thank you very much. really love being on here.

Natalia: Well, we had a great time unlocking insights into security from research to artificial intelligence. Keep an eye out for our next episode.

Nic: And don't forget to tweet us @MSFTsecurity or email us at securityunlocked@microsoft.com with topics you'd like to hear on a future episode. Until then, stay safe...

Natalia: Stay secure.

HOST(S):

Nic Fillingham likes to ask questions and find out how stuff works. For over 15 years Nic has worked at Microsoft on Xbox, Windows, developer tools, Microsoft 365 and Security. A transplant from Australia, Nic lives just outside of Seattle on a small farm with his family and too many guitars.

Natalia Godyla is an award-winning B2B product marketer and speaker, currently in the Security Product Marketing group at Microsoft. She specializes in cybersecurity marketing and has a Sec+ certification. Fun fact: Natalia is also a published poet and founder of Rebel Data.

Schedule: Wednesdays

Credits: Executive Producer is Bruce Bracken, Producer is Rob Petrillo, Production Manager is Max Solomon, and our Audio Engineer (and magician) is none other than The Great Rich Cerbini.

Creator: Microsoft