What the Fuzz?!

Transcript

Nic Fillingham: Hello, and welcome to "Security Unlocked," a new podcast from Microsoft where we unlock insights from the latest in news and research from across Microsoft security, engineering and operations teams. I'm Nic Fillingham.

Natalia Godyla: And I'm Natalia Godyla. In each episode, we'll discuss the latest stories from Microsoft security, deep dive into the newest threat intel research and data science.

Nic Fillingham: And profile some of the fascinating people working on artificial intelligence in Microsoft security.

Natalia Godyla: And now let's unlock the pod.

Natalia Godyla: Hello, everyone. Welcome to another episode of "Security Unlocked." Today we have Edir Garcia Lazo, an AI engineer at Microsoft, joining us to talk about a recent blog he authored called "Combing Through the Fuzz: Using Fuzzy Hashing and Deep Learning to Counter Malware Detection Evasion Techniques." Don't let that scare you - the big, heavy terms - deep learning, fuzzy hashing. One of the best parts of this episode is how well Edir describes these technical concepts. He has made it super accessible to those who, like myself, don't have data science or engineering background.

Nic Fillingham: Plus one Natalia, yeah, absolutely. I don't have a data science background. I'm not a software engineer by trade. In fact, I don't even know what I do. What do I do, Natalia?

Natalia Godyla: (Laughter) Mostly talk.

Nic Fillingham: Yeah, mostly talk and make jokes that get cut by the editor. But not today.

Natalia Godyla: (Laughter).

Nic Fillingham: I think you're actually going to hear some of my champagne humor. I love this conversation with Edir. We've talked about deep learning multiple times. We've talked about machine learning, you know, maybe in more than half of our episodes. And it was fascinating to really further understand this concept of deep learning and the relationship between sort of multilayered machine learning systems.

Nic Fillingham: Another thing I'll point out is - you'll probably hear in the very beginning of the interview - one of the first questions I asked Edir was, hey, I've heard this fuzz word before. And I didn't know a lot about this topic. And so I sort of asked upfront to Edir, is fuzzing or, you know, fuzz testing - is that the same or related to fuzzy hashes? And Edir was very gracious in his response.

Nic Fillingham: But I think we sort of got to the fact that they're not really related. They're essentially different things. Fuzz testing is about putting sort of arbitrary or random data into a sort of a computer program to see what might happen. You may even do it on an automated scale, sort of see if you can make the program crash and identify vulnerabilities, et cetera, et cetera. Fuzzy hashing, which I won't give away, is something very different, but a really cool technique that Edir is utilizing to help identify polymorphic malware as well as spam and other bad stuff out there in cyberspace. It's a fantastic interview. I love this episode.

Natalia Godyla: As do I. As I've joked to you, Nic, I will definitely be listening to this one again myself. So...

Nic Fillingham: We need the ratings. That's good.

Natalia Godyla: (Laughter) That's what I'm doing it for. All right. With that, on with the pod.

Nic Fillingham: On with the pod.

Nic Fillingham: Welcome to the "Security Unlocked" podcast, Edir Garcia Lazo. Edir, thank you very much for joining us. Thanks for your time.

Edir Garcia Lazo: Hello, Nic. Hello, Natalia. Thank you for having me.

Nic Fillingham: Now, first and foremost, how was my pronunciation on your name? Do I need to do that again?

Edir Garcia Lazo: No. That was perfect (laughter).

Nic Fillingham: All right, excellent. Thank you so much for joining us. Today on the podcast, we're going to talk about fuzzing and fuzzy hashing and deep learning and malware detection and evasion. I'm so excited for this. Edir, you're joining us, in part, because you authored a blog post from July 27, 2021, called "Combing Through the Fuzz: Using Fuzzy Hashing and Deep Learning to Counter Malware Detection Evasion Techniques." I loved your blog. So first of all, thanks for writing a fantastic blog. It was super informative. I learned a lot. I have two pages of questions, which - this is a podcast, so you can't see me holding my questions up to the camera.

Natalia Godyla: (Laughter).

Nic Fillingham: But I'm showing it to everyone else on the call. And I'm really excited to jump into this and learn more. You know, this is your first time on the podcast, so maybe before we jump into the blog, could you introduce yourself to the audience? Tell us a bit about what you do at Microsoft, what your team does. What does your day-to-day look like?

Edir Garcia Lazo: Yeah, sure. Absolutely I can. I can do that. Well, I am a data scientist, and I work for the Microsoft Defender cybersecurity artificial intelligence team. We call it, for short, cyber-SI. Within that team, there are, like, three sub-teams - like, sibling teams. And from there, I belong to the malware classification team. And these are all, like, fascinating and awesome, multidisciplinary and very diverse teams. And I mostly specialize on writing cloud machine learning models. So that means all of the intelligence that the Defender client will reach out for, I make a few of those models. And my team is in charge of, like, building and maintaining those models.

Edir Garcia Lazo: I guess as in what my day-to-day looks like, I think, like many people in our team, I basically catch math with bad guys (laughter). That's a joke. So...

Nic Fillingham: No, we love it. That's a common theme that we've had on the podcast here. You catch bad guys with math. That's awesome.

Edir Garcia Lazo: Exactly. So I'm just, like, making a twist of that.

(LAUGHTER)

Edir Garcia Lazo: So, no. I mean, in reality, my day-to-day - like, it looks like trying to familiarize myself with the threat landscape, either looking at threats myself or sometimes with the help of, like, either threat hunters, reverse engineers, security researchers. And, like, the point of this is just to turn these threats into, like, data or just basically look at data that we might already have on these threats. You know, also make sure that performance metrics are OK, looking OK for our customers, like, to try to keep low false positive rates, to try to keep low known false negatives. This is kind of, like, an interesting one because the minute you know that you have a false negative, it stops being a false negative, right? Like, so it's, like, unknown false negatives.

Edir Garcia Lazo: And also, obviously, like, prototyping, developing, writing the code for, like, machine learning models whose goal is to increase the protection for customers. You know, so I really am, like, down there in the trenches. Like, what I say - I call it - I make the joke also of, like, I'm the one debugging the tensor slices, which is, like, a deep learning thing. When you're trying to make it work for the first time, you know, you need to make sure that all of the linear algebra math, you know, like, maps. And I am the one doing that type of work (laughter).

Nic Fillingham: Edir, LinkedIn tells me that you've been at Microsoft for over 10 years.

Edir Garcia Lazo: Yes, that's right.

Nic Fillingham: And in the security space that entire time?

Edir Garcia Lazo: Yes, exactly - not necessarily Defender, but yes, in cybersecurity. Absolutely.

Nic Fillingham: What was your first role at Microsoft when you joined?

Edir Garcia Lazo: I was actually - I've jumped roles three times. Like, my very first role was actually a software developer engineering tests. Like, back in the day, we used to have, like, this role that was SDET. So it's a similar role to, like, traditional development. But you were, like, just in charge of, like, writing test cases and trying to break things. So that was what I originally got hired for. And I did that for a couple of years, and then, I guess, eventually transitioned over to, like, just a regular software development engineer. And in the last couple of years, I have transitioned to, like, a data science position. But I have been, all of my career, in cybersecurity and most of my career at Microsoft, really?

Natalia Godyla: And, Edir, today, as Nic mentioned, we're talking about a blog that you authored called "Combing Through the Fuzz: Using Fuzzy Hashing and Deep Learning to Counter Malware Detection Evasion Techniques." Equally thrilled to be talking about this blog today. Let's start with the impetus for this initiative. So what prompted the team to look at fuzzy hashing and deep learning as techniques for detection as opposed to some of the more traditional methods, like signature-based detection?

Edir Garcia Lazo: Yeah, that's a good question. So I feel like the answer to that - it's mostly driven by polymorphic malware and adversaries. So it turns out that traditional methods are, like, relatively good to capture known malware. Like, it's just, as you mentioned, in the form of signatures, which are, like, these small definitions that are, like, shipped to your clients or, AKA, your actual Defender in your Windows machine. And if it's something that we've literally seen before, then it's just as easy as comparing the SHA-256, which is, like, a fingerprint of the malware. Things get tricky when it's stuff that we have not explicitly seen before. However, oftentimes, we have seen things similar to that. And this is where fuzzy hashing jumps in.

Edir Garcia Lazo: And why deep learning? Because deep learning has been excelling in, like, tasks like natural language processing. So we are just kind of, like, curious, from the research point of view, to understand how many of these techniques are applicable to, like, security data sets.

Nic Fillingham: So the first word you mentioned there that I want to jump in on is sort of polymorphic - polymorphic malware. So poly - multiple - morphic - change - multiple changing malware. Is that - am I on the right track?

Edir Garcia Lazo: Yes, absolutely. Yeah. Polymorphic malware really means that - again, because adversaries are trying to evade detection, basically what they will do is, like, they will sometimes recompile the code in, like, a different way or, like, change something very minimal in the actual malware payload. When they do that, what ends up happening is, like, this creates, like, a whole different SHA-256 because of some of the mathematical properties that I talk about on the blog post as well. So if you just change one character in the malicious payload or in the program as a whole, then you will get, like, a radically different SHA-256 - again, a different fingerprint. So that just effectively - you can no longer just, like, look up on a table and it's, like, oh, this is something we've seen before. And that's a problem, right?

Edir Garcia Lazo: So I think, like, a lot of commodity, like, script kiddie malware, that you just download off the web and try to send to someone - it's, for the most part, already known, and it should be blocked by us and by pretty much any other decent AV. But the tricky part is, like, again, when you are able to remix these in such a way that, like, it's just something that doesn't come up with in, like, simple lookup tables.

Nic Fillingham: Got it. So a malware creator - I don't want to say the word engineer, but someone that's creating malware - they could architect their code in such a way that it compiles slightly different many, many, many times, thousands of times, millions of times. And so anything that's looking for a simple hash, anything sort of heuristic-based, is going to be fooled, even though it's actually doing the same thing. Because they've made some minor change - even just, like, a comment in the code - right? - something that may not even impact the way that the code is executing. Am I on the right track here? That's why this sort of, like, signature-based detection is sort of not able to keep up.

Edir Garcia Lazo: Yes, that's exactly right. This is one of the reasons why we complement it with machine learning. And yeah, it really could be things that don't really change the behavior of the malware itself. So it could be comments in the code that will generate a different fingerprint or even just, like, appending, like, a character. Oftentimes, you would see that they would do things like this. Like, even if you grab, like, an already existing malicious payload and then append, like, an extra character in the end of the - what we call, like, PE file, which is a portable executable windows file. Then when you hash that end result, it actually has a different SHA-256. So, like, just doing really simple things like that with just tools that already ship with your computer - things like PowerShell - then you can, in a way, come up with simple ways to, like, evade some antivirus.

Natalia Godyla: I recall in the blog you mentioned that this creates circumstances in which most malware is seen only once. And to me, that really impressed the magnitude of the problem here, the fact that you simply can't scale to define all of that malware. You need a solution like this. So there was one other term that I really would love to hear a definition of. So fuzzy hashing - before we dive in, could you explain what fuzzy hashing is?

Nic Fillingham: Hey, can I jump in, actually?

Natalia Godyla: Yeah.

Nic Fillingham: Because I want to start with fuzzing. Because from previous blogs on the Microsoft Security blog, I had read about fuzzing. And to me - and, Edir, I'm probably wrong here, so, you know, I'm going to put you on the spot. But fuzzing - my understanding of fuzzing was to sort of utilize sort of random inputs, random sort of generation of potential sort of input, to test code and see, you know, can you break it by just dumping random input in? First of all, do you know what the definition of fuzzing - and am I on the right track there? And then does that help us in potentially understanding fuzzy hashing?

Edir Garcia Lazo: It is related. It is...

Nic Fillingham: OK.

Edir Garcia Lazo: ...Something like that. So there are, like, several things, like fuzzy string matching or like fuzzy logic. Really what it means is, like, that it's not going to be exact. Like, if you want to kind of, like, have - a similar synonym for the word fuzzy would be, like, inexact. So I'll give you an example of, like, fuzzy string matching. I think in traditional language processing, you could have Microsoft with a capital M and then microsoft without the capital M. So there are, like, really simple ways of, you know, like, hey, just, like, have this string be set to lowercase or to all-uppercase so then that way you can actually compare them, right? Because in the end, they do say the same thing. Both of the words are Microsoft.

Edir Garcia Lazo: But here's where it gets tricky. So like, what if you can't do that - right? - and you still want to be able to compare, like, how closely related, like, these words are? So this is when things like fuzzy string matching could do the trick, right? Because all of the characters are actually the same but the capital M. So you can kind of, like, estimate, like, the similarity between two things. And you can make those two strings, like, be a match, to a certain extent, as long as, for example, they have, like, a certain degree of likeness to each other. Then you can call them the same thing.

Edir Garcia Lazo: So this is where - that's string matching. But this is where then doing things like fuzzy hashes would jump in, right? Which - unlike a traditional cryptographic hash, what you're trying to do here is similar inputs. You want to calculate similar hashes for them. So that way, you can kind of, like, keep the relationship of similarity between different entities.

Nic Fillingham: Got it. Cool. So fuzzing is about sort of understanding the relationship between two things that are different but can be sort of considered similar, but inexact, like a synonym. So that's sort of the fuzzing part. So then fuzzy hashing or fuzzy hashes - give us the - let's go in that direction. What does that mean?

Edir Garcia Lazo: So first of all, let me clarify that there are several types of fuzzy hashes. Fuzzy hashes are - it's just the umbrella term. Under there, you can have, like, rolling hashes. You can have homology-based hashes. You can have locality-sensitive hashes. And there are all kinds of different ways to approach this problem. But again, as a concept, the end result should be something like - given similar entities, then you should get similar hashes out.

Edir Garcia Lazo: So I think - like, for example, in the homology type of fuzzy hash, really what it's doing is, like, comparing - it's traversing the entity over steps of Windows, and then it's trying to see if it can spot, like, sequences and then calculate a hash - a rolling hash it's called - of, like, several pieces of the entity. In this particular case, it could be any type of document - right? - a document, a malicious document or a clean document, or a malicious P or a clean p. But in the end, it's just a blob of bytes that you are, like, traversing over and calculating, like, its different hashes.

Nic Fillingham: Before we sort of go forward, I just want to sort of come back a little bit to this definition of fuzzy hashes.

Edir Garcia Lazo: Yeah.

Nic Fillingham: Because you have a really interesting graphic in the blog post. I think it's Figure 3. And you know, you start with the two blocks of text...

Edir Garcia Lazo: Yes.

Nic Fillingham: ...Garden0 and garden1, and then showing how the SHA-256 is radically different for both of them, even though the only thing that's different is the first word. And then - but then using fuzzy hashing or a fuzzy hashing technique, you're able to generate two hashes that are very, very similar. They're still different, but they're extremely similar.

Edir Garcia Lazo: Yes.

Nic Fillingham: I've actually - I've sort of zoomed in on it here, and I can see that - I think it's just, like, one character in the hash at the very, very beginning, which may or may not correlate to that word at the beginning, in the plain text, that's different.

Nic Fillingham: So what you were able to do with this example here of the GoldMax variant is, you had a fuzzy hash for GoldMax that you'd already seen in the wild. And so when this new variant came out, this fuzzy hashing technique was applied against the new variant, and then the fuzzy hash that was generated was so similar to the previous one that the logic in the malware detection was able to say, these are the same thing. Is that right? Did I get that right?

Edir Garcia Lazo: Yes, absolutely. That's exactly right. And I really like - I made an effort in, like, trying to explain this with just, like, simple text. This is hard to convey on, like, what it would look like on a real malware portable executable. So what I do here in the graphic - I exemplify this with just regular text. I grab a snippet from a short story that I really like, and then I apply a traditional hash to it, a traditional cryptographic hash. And just by changing one word in the beginning, like just before for after, then I calculate the MD5 of the file. MD5 is, like, one of the most commonly used cryptographic hashes, along with SHA-1 and SHA-256. I also mentioned that in the blog post. And then I circle in, like, red - in, like, red circles - like, how is it that, you know, like, they're completely, radically different, right? So that means that if this were to be malware samples, by just changing a little bit, then something - the output would be completely different. And that opposed to, like, the fuzzy hashing - the fuzzy hashing here I exemplify with an also very common rolling fuzzy hash called, like, SSDEEP or CTPH, which is short for Context Triggered Piecewise Hash. And...

Nic Fillingham: I was going to say that. Yeah, yeah, yeah.

(LAUGHTER)

Nic Fillingham: That's false confidence. I have no idea. Sorry. Keep going.

Edir Garcia Lazo: Yeah, it's just the technical name of it. But it doesn't really matter. But, like, over here, the interesting part is, like, here you can actually see how this - you can also know, like, very intuitively how, as you are progressing through, like, the blob of bytes, that everything in the hash is exactly the same but the first part. Like, only the first character would be replaced, and the first one from P and the second one to zero, right? And that means that only the first part of the file actually is different. So on here...

Nic Fillingham: Oh, so the placement of the difference correlates to where the hash will actually change.

Edir Garcia Lazo: Yes.

Nic Fillingham: So in this instance, it was just the first word. Therefore, it's the first character of the hash.

Edir Garcia Lazo: Yes.

Nic Fillingham: But if it was the last word, it would be the last character. If it was somewhere in the middle, it would have, like, a linear relationship to where in the hash that change is.

Edir Garcia Lazo: Right. That's correct. For this type of hash...

Nic Fillingham: Oh, that's cool.

Natalia Godyla: That's kind of cool.

Edir Garcia Lazo: Yes, it is really cool. For this type of rolling hash, yes, you will see that. For other types of fuzzy hashes, like locality sensitive hashes, that thing would not apply. But it would still kind of, like, with high probability, collide into the same position, and you will have something that resembles, like, the other hash. But for this type of hash - this is why I chose this one to make this as an example. But, yeah, I think that is really cool. And there are, like, several other implementations - I also mention that in the blog post - like TLSH or SSDEEP, which is the one I used. And there is, like, several others that you can just go onto GitHub and find open-source implementations for. Here, like at Microsoft Defender, we have the ones that we use outside and some other ones that we have just implemented in house.

Natalia Godyla: What about the magnitude of the change? Is there a limit to how much change the fuzzy hashing methodology can recognize? So in this case, for instance, you changed one word at the beginning, and then there is a related change in the fuzzy hashes, but if you changed 50%, is it harder, then, to see the similarity to the two? What is the true limit there?

Edir Garcia Lazo: Right. Yeah, that's also a very good point. So that's another thing you could do, right? I give the example of, like, just appending an extra character of already malicious portable executable, but again, that's, like, script kiddie thing, right? If you have access to the source code, then nothing is stopping you from, like, actually renaming every variable and just recompiling the program, right? When you do that, then, yes, like, even the fuzzy hash would be radically different. So the behavior could be exactly the same, but if every time you were using Variable A, you all of a sudden have changed it to Variable B, then the actual bluff of the binary sequence that represents that program would - could be, potentially, very different. So, yeah, I mean, again, this method is most definitely not perfect.

Nic Fillingham: Got it. All right, next word from the blog post that I would love for you to help me understand...

(LAUGHTER)

Nic Fillingham: ...Is I would love to know what is a multilayer perceptron? Because I think they were aligned with the Autobots in the war against the Decepticons when...

Edir Garcia Lazo: (Laughter).

Natalia Godyla: Yeah, then Joe Klein adopted it (laughter).

Nic Fillingham: Or was it the Dinobots? I can't remember which one it is. What is a perceptron? And then, I guess, what is a multilayer perceptron? And if you could answer in the form of a question, that would be great. No.

(LAUGHTER)

Edir Garcia Lazo: Yeah. So actually, perceptron is, like, the basic - the most basic prototype of, like, what a binary classifier neuron is. It dates, like, back to the '70s.

Nic Fillingham: Wow.

Edir Garcia Lazo: Yeah, yeah, yeah. It's actually really, really exciting research. And a multilayer perceptron is what its names says, right? It's just basically having arrays of those stacked on top of each other. Back in the day, they used to have - like, there's a lot of, you know, like, supervised learning algorithms. Think of, like, decision trees or random forest or, like, linear regressions. And this was one that, back in the day, was developed. And it represents - it functions - let me just say, it's modeled after, like, how a brain neuron really works, which is - it receives some input. And then inside it has, like, a thing called, like, a nonlinearity or an activation function. And depending on that, it would either, you know, like, trigger or not. So it would be, like, an on-or-off-depending kind of thing. So it turns out that when you stack those in very deep ways, then you can kind of, like, be able to learn, like, nonlinear relationships in data, which turns out that it's very - a very useful thing to do because you can learn, like, very complex problems using just these. So the research has been put out for a very long time, but for the longest time, these were not really used because of - we did not have enough computing capacity. And that changed, you know, I want to say, like, recently with advancements in, like, no hardware or, like, just being able to run these on, like, GPUs - right? - which are, like, graphical processing units, which it turns out it excels in calculating, like, this type of parallel computation, which is really just, like, a matrix that's been multiplied over and over again.

Natalia Godyla: Wow. OK, so I'm going to orient our audience for a little bit. So the technique that you were just describing was related to natural language processing for fuzzy hashing or at least that's the context for this blog, correct?

Edir Garcia Lazo: Yes, that is right. Like, the multilayer perceptron itself is the deep-learning part. I know there are, like, so many concepts.

(LAUGHTER)

Edir Garcia Lazo: No, I mean, there really are. But, like, the multilayer perceptron part, it really is, like, the most basic example of, like, deep learning. It's the most common and the most studied, I would guess, because it's, like, one of the essential ones. The natural language was, like, the creation of embeddings, which are - like, you're creating high-level representations, like, multidimensional representations of, like, usually words. Although, in this particular case, we're not using words, but we're using, like, the homologies of, like, the fuzzy hash.

Edir Garcia Lazo: So if you were to go - and, for example, let's just assume that we were talking about the ssdeep algorithm that we were just describing on. Like, that first character - either P or 0 - that is the difference. That would get mapped to, like, another dimension vector - let's just say a 32-integer vector. And when you do that, then you are able to calculate, like, relationships between those homologies, like meaning - like, this certain sequence would like to find maliciousness - yes or no. And that is commonly used in, like, natural language processing. Natural language processing - think about it. Like, it starts with words because it's just natural language. And then you would do a process called tokenization. Once you do that - like, the most common one would be just by words. So you're, like, turning each word to a number. But then that number, you're also turning that number into a vector.

Edir Garcia Lazo: When you run these through a neural network, they have the interesting property that they can learn relationships between the actual words. So there is a very famous concept of, like, man is to woman as, like, king is to queen. So it's able to learn that relationship. If I were to ask it, like, man is to woman as king is to? - like, the most vector - the most likely vector to come out of that calculation would be queen because it has somehow been through text - being able to learn the relationship between, like, what gender means or, like, what royalty means, right? And you can do all kinds of, like, vector math that would work on that. And I was just very curious to see if this type of technique would be applicable to malware. And we decided to explore on that. And it turns out that it works out, like, really well. So that's very exciting.

Nic Fillingham: So in your example, I think from figure 2 - excuse me - in the blog post, where you have that story text, where you've just changed the one character - so we showed how the SHA-256 gives the radically different hashes. But now we're shown how the ssdeep gives you very, very similar hashes. In the diagram 4, where we - you show the files, the hash, the embedding of the fuzzy hashes and then the multilayer perceptron, is the multilayer perceptron, is that the thing that ultimately would tell a function, yeah, these two things are really similar and you should treat them the same?

Edir Garcia Lazo: That's right. It's actually a combination of all of those things. Fuzzy hashes would kind of like all recapture some of the similarity, meaning - like, in this particular example that I show, they are kind of like already observably similar. But then on top of that, then you're representing those sections as words, which can learn, like, even finer details. I think one of the comments that I put on the blog post is like, OK, so why actually do you need to use deep learning for all of this, right? Like, why don't you just compare the fuzzy hashes and say, like, hey, you know, if this is very similar to something that you've seen that I know is malware, then why not just call it malware? And then it turns out...

Nic Fillingham: Why not just call it malware, Edir? Why are you using deep learning.

(LAUGHTER)

Edir Garcia Lazo: Yeah, yeah. No. And it's a completely fair question. And the justification for that is, like, oftentimes there are, like, malicious versions of, like, clean files. So for example, I've seen in the wild, like a TrojanX version of, like, the Discord app. So if you were to see the similarity between the malicious version and the clean version, their fuzzy hashes are actually very similar. And that is just because one of them is portraying itself as the other one, right? So if you just were to compare like, hey, if it's too similar to this that I already know it's malware, then you will be constantly be causing false positives on the real disk or .exe, right?

Nic Fillingham: Ah, great example.

Edir Garcia Lazo: Exactly. And this is what's cool about this, because this is not only about how similar those hashes are but also, within those differences, which ones are relevant and which ones actually indicate maliciousness. And that's the part where, like, all of the embeddings and all of the multilayer perceptron would, like, jump in and, like, figure out - and that - you know, I like the name. Like, this is when you comb through the fuzz, right? This is where you're, like...

Nic Fillingham: Yeah.

Edir Garcia Lazo: ...Defining which of these parts are actually irrelevant and which ones are not.

Natalia Godyla: So you mentioned at the beginning of this that you were mostly curious how MLP would work for this particular problem. And throughout the process, you tried several different techniques, not just multilayer perceptron. So can you walk us through how you tested these different techniques, why some worked, why some didn't?

Edir Garcia Lazo: Yeah, yeah, absolutely. Yeah. I do talk about this, and it's just because I tried a lot of different things, and again, out of just sheer curiosity. So I guess I really appreciate the people in my team that support me for, like, doing all these kinds of like crazy scientist things.

Natalia Godyla: (Laughter).

Edir Garcia Lazo: So I tried different architectures. I tried convolutional networks. Convolutional networks are usually used in vision problems - and from the literature, that they just capture - the same way you would go traversing in Windows when calculating the fuzzy hash, you would do the same, like, with the whole entity. So it just kind of, like - it worked - most of the approaches worked, but some better than others. I also tried transformers, which are, like, a very new, state-of-the-art...

Nic Fillingham: They're - I think you'll find they're robots in disguise.

(LAUGHTER)

Edir Garcia Lazo: Yes. No, no. They're not entirely that, but...

Natalia Godyla: (Laughter).

Edir Garcia Lazo: ...They are the way that new, state-of-the-art models like OpenAI's GPT-3 work, right? Like, the technique underlying these models really is a thing called, like, the attention mechanism that I will not get into. But if you're, like, into data science and, like, natural language processing, I would highly encourage you to read up on these. I also try these. And this one was also doing really well. The problem with this is that it just uses a bunch of parameters.

Edir Garcia Lazo: So imagine - in a deep neural network, when you're finally able to train it, like, it actually has a size, which is a number of numbers that it needs to effectively add up and multiply and do calculations on. And this transformer was yielding really good results, but it was also, like, several million parameters (laughter). And it was just not cost effective to run that. Because imagine - every time you're trying to score an entity, then you effectively need to make all of these calculations.

Natalia Godyla: So we've already blocked malware - like, the GoldMax malware that we discussed earlier - in Microsoft 365 Defender. What is the future of this new approach for our technologies? Are we actively - it seems like we're actively already using fuzzy hashing. But I'm assuming the natural language processing bit is on the newer end. Is that something we're looking to bring to the product?

Edir Garcia Lazo: Yes. So the reality is, like, we're beginning to dabble with this, and we are very excited about it. So there are, like, several things that we want to try. So we have been considering and have been tinkering with things like, again, large language models like these or graph neural networks or reinforcement learning. It just opens up the door to having done the groundwork for this type of model. It's going to help us a lot.

Edir Garcia Lazo: And we are hoping that we will have, like, our - what I call our AlexNet moment - you know, AlexNet moment for security. What do I mean by that? I was earlier talking about the multilayer perception. And it turns out that all of that, all of those decades during the '70s and '80s and '90s - there was something that people call the AI winter. There was not very much progress, and there were just, like - people were, like, debating on what - the real techniques or how to move the discipline forward. And then I believe it was in 2012. Then they came up with, like, this state-of-the-art that blew out of the water, like, this metric called AlexNet, which was a data set trying to identify a thousand classes and images. So we're just hoping that we are bound to find that for security. So that means that it's going to be something that is, like, such a big breakthrough that we're going to give the bad guys a real headache.

Edir Garcia Lazo: So there are, like, several ideas on top of that. Some of the ones like I just mentioned or, I guess, also autonomous defense systems. You guys have actually covered this on the podcast. Because I haven't told you, but I am a fan.

(LAUGHTER)

Edir Garcia Lazo: So, like, CyberBattleSim or SimuLand.

Nic Fillingham: Yeah.

Edir Garcia Lazo: So think about, like, things like this, but without the same part. So that means this is just a foundation in simulation that you end up using and deploying and deployed for real - like, making, like, actual blocking and defense choices. We're excited about that as well.

Nic Fillingham: What's next on the horizon for you, Edir? Either - you know, is there anything you're working on at work that you can sort of talk about that's got you really jazzed or, you know, just something in the industry that you're keeping your eye on?

Edir Garcia Lazo: Yeah, yeah, absolutely. I think I already mentioned a couple of these. I have been finding myself working recently on developing a lot of adversarial models, which is just tackling these threat actors and, like, just making sure that they are thwarted with machine learning. So that's something that excites me recently. And yeah, same thing - doubling and tinkering with large language models like - things like BERT and RoBERTa and things like GPT-3 and applying them to security data sets. I think that has a lot of potential that we have not entirely explored. So I am excited to continue to do that.

Edir Garcia Lazo: So those are things that I'm passionate about, but I also have a little bit of worries (laughter). Things like email malware, like, how prevalent it still is, and it's still the No. 1 way to access systems. And do not quote me on this one because, you know, this is something that I might have read something, but I don't have, like, the hard proof to say so.

Edir Garcia Lazo: Also, the proliferation of, like, high-impact ransomware, like, either human-operated or not. That's something that worries me as well - examples like the Colonial Pipeline. The proliferation also of supply chain attacks, you know, like what happened in Kaseya or like what happened in SolarWinds - things like that also make me a little bit worried. This just tells you - sheds some light about my personality, which is I have more worries than things that I'm excited about.

(LAUGHTER)

Edir Garcia Lazo: But I guess, in general, the lack of international policy in cybersecurity - like, that's something that also - I mean, I guess we're taking steps in the right direction recently. But it's just a big problem. And finally - and I hope this doesn't - yeah, this is one of the parts that I might not - hopefully, this won't do the opposite of inspiring the bad guys into doing this. But also, like, the next step of like polymorphic malware, which is metamorphic malware, which is malware that's able to change its code dynamically and that's able to evade sandboxing, that is, you know, like, packed with unknown tools, that it's encrypted - all kinds of things that makes our lives complicated - like, I am also concerned about, you know, seeing a rise on those. So we will continue to have jobs. Sadly, there's that. But yeah.

Natalia Godyla: It seems like there's a fine line between what worries you and what drives your passion.

Edir Garcia Lazo: Exactly (laughter).

Natalia Godyla: Well, thank you for joining us today, Edir. I know you came on to talk about your blog, and here we are trying to answer all of the big questions of AI. I really appreciate you entertaining those questions, and hopefully we'll have you on the podcast again.

Natalia Godyla: Yeah. Thank you so much.

Nic Fillingham: Thanks, Edir. That was great.

Natalia Godyla: Well, we had a great time unlocking insights into security, from research to artificial intelligence. Keep an eye out for our next episode.

Nic Fillingham: And don't forget to tweet us at @msftsecurity or email us at securityunlocked@microsoft.com with topics you'd like to hear on a future episode. Until then, stay safe.

Natalia Godyla: Stay secure.

HOST(S):

Nic Fillingham likes to ask questions and find out how stuff works. For over 15 years Nic has worked at Microsoft on Xbox, Windows, developer tools, Microsoft 365 and Security. A transplant from Australia, Nic lives just outside of Seattle on a small farm with his family and too many guitars.

Natalia Godyla is an award-winning B2B product marketer and speaker, currently in the Security Product Marketing group at Microsoft. She specializes in cybersecurity marketing and has a Sec+ certification. Fun fact: Natalia is also a published poet and founder of Rebel Data.

Schedule: Wednesdays

Credits: Executive Producer is Bruce Bracken, Producer is Rob Petrillo, Production Manager is Max Solomon, and our Audio Engineer (and magician) is none other than The Great Rich Cerbini.

Creator: Microsoft