Research Saturday 10.30.21
Ep 207 | 10.30.21

Malware sometimes changes its behavior.


Dave Bittner: Hello everyone, and welcome to the CyberWire's Research Saturday. I'm Dave Bittner, and this our weekly conversation with researchers and analysts tracking down threats and vulnerabilities, solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace. Thanks for joining us.

Tudor Dumitras: So, it started from the observation that goes back ten, fifteen years, that malware, when executed on different hosts, and also on the same host but at different times, it sometimes changes behavior.

Dave Bittner: That's Tudor Dumitras. He's a professor and researcher at the University of Maryland the Maryland Cybersecurity Center. The research we're discussing today is titled, "When Malware Changed Its Mind: An Empirical Study of Variable Program Behaviors in the Real World."

Tudor Dumitras: This is something that I've wanted to do for a long time, but only recently I was able to collect, in collaboration with an industry partner, collect a large enough dataset, you know, to analyze this. And this important to understand, because malware researchers typically collect execution traces in a sandbox – so, that's a controlled lab environment. And they do this to understand what the malware does, to analyze the malware, to figure out if it belongs to a known family, and also to create detection signatures, behavior-based detection signatures.

Tudor Dumitras: The problem is that when malware can do different things on different hosts, this will affect the effectiveness of the conclusions of the malware analysis and the effectiveness of the signatures created for detection. Some people call them "split personalities." So, when malware does different things on different hosts, they are often implemented by the malware authors with intention to evade sandboxes, so do not perform the malicious behavior in the sandbox.

Tudor Dumitras: And we just wanted to understand this. There hadn't really been a large-scale measurement of just how much this behavior changes in the real world. What exactly – what components of the behavior are more likely to change? You know, how does malware change versus benign? And how does this affect malware detection and malware analysis? In this research, in this paper, we worked with a partner to analyze the data set that was recorded on five-and-a-half million real hosts out there that included multiple executions for each sample, multiple traces for each sample.

Dave Bittner: Well, let's walk through the methodology together. I mean, at the outset, how did you all decide to come at this? When you got malware that you suspect is trying to avoid you taking a closer look at it, where do you begin?

Tudor Dumitras: Right. So this the core of the matter when you know that malware is likely to evade, you know, has the intention and the incentive to evade detection and analysis. How do you go about selecting a sample that is that is representative? And really the only way to do this to look at what happens on real hosts, which is also what makes this difficult to do. But we work with an industry partner, which has an antivirus product that runs on and hosts. It collects execution traces, it monitors for a little while what the malware is trying to do, and collects these things in order to perform further analysis to try to figure out how these things actually happen in the real world

Tudor Dumitras: So, it's important to understand that this only done as a last line of defense. So, if they can detect the malware through any other means, any other engine, they would just detect it. They would not let it run at all. And similarly, if something is clearly benign or is known to be benign, they would not do anything to it. They will exonerate it. But there is always a gray area, a set of binaries that, they are suspicious, but still, you cannot be completely certain, so they execute it. And they also stop the execution as soon as the malware tries to do something nasty, as soon as it becomes clear that something bad is happening. But a lot of the initial setup and initial behaviors of the malware are recorded, and this gives us a wealth of data to look at, in particular at the differences in behavior between different hosts of the same malware sample. And because this is, actually we never tried to distribute the malware to the host, we never tried to do this in a lab, this is in the real world and these are actual hosts that are under attack from the malware. So this is what gives us some confidence that these results are representative.

Dave Bittner: Well, can you take us through some examples here of the types of things you were looking for and some of the conclusions you are able to make?

Tudor Dumitras: Let me give you one example. So the Ramnit worm, for example, is a well-known piece of malware, and in the particular variant that we had in our dataset, it tries to exploit a vulnerability – it's an old vulnerability, CVE-2013-3660 – and it does this in order to gain privilege escalation on Windows 7, in particular. When it launches this exploit, what you see in the execution trace is that it creates hundreds of mutexes until the exploit succeeds. So this is part of the exploit execution process. But the worm is smart, so it tries to profile the target. So, if it figures out that the target does not include the vulnerability or it's already running in admin mode, so it doesn't need to do privilege escalation, it doesn't launch the exploit. So, should an analyst run this malware in a sandbox, they would only observe one of these behaviors, depending on what the environment was

Tudor Dumitras: In general, if you look at executions on different hosts or even on the same hosts, but maybe a few weeks later or a few months later, you may see different behaviors. So, malware performing different registry operations, making different or additional or removing certain API calls, or in some cases, exiting without doing anything. So, like I said, the existence of these split personalities was known, and researchers and practitioners also developed methods to try to discover the existence of these evasive behaviors in malware samples, but it was never measured at scale in the wild before. So, and because of that, it is hard to tell what impact does this really have on the way we do malware analysis and malware detection.

Dave Bittner: Hmm. Now, you all were able to gather – again, as you mentioned with your industry partners – quite a data set here. Can you describe to us, you know, how big was it? How did you go about gathering it, and having that large a data set, what does that provide for you as a researcher?

Tudor Dumitras: Absolutely. So, the dataset, as I mentioned, comes from a collaboration that we had with an industry partner. I, personally, my background is in the industry. I used to work at Symantec Research Labs before I became an academic. So back in the day when Symantec was the largest security vendor. And I really liked to work with folks in the industry to understand what the biggest problems are that they are facing, and also to help them with large data analysis projects like this one.

Tudor Dumitras: So, the data set was collected on 5.4 million real hosts, and it includes multiple execution traces for the same samples. So I think in total we have about seven-and-a-half million execution traces. In some cases, we have hundreds of execution traces per sample. And what these traces are is they come from API traces. So, these are these are Windows malware that perform API calls in order to download files, to connect to the internet, to set certain registry entries or mutexes. So we recorded the actions that the malware was trying to perform. So, for example, when the malware is trying to create a new file – as, for example, ransomware would do – and each of these actions has a certain parameter. So, for example, the file name or the registry path that's being accessed.

Tudor Dumitras: We have collected this dataset, we parsed it into these actions and parameters. So, for each execution is attributed to the process ID that triggered it, including things like thread injection and launching new processes, so we can we can figure out what was the executable that started all of this. So then we look at the hash of that executable and then we look at multiple – when we have multiple traces of that, the same hash, we analyze the variability in their behavior.

Tudor Dumitras: So, we do this in a couple of ways. We look at how the number of actions and the types of actions differ, and also, we look at the differences in the parameters. We try to break this down into variability that occurs across hosts and also variability that occurs across time. And then we also try to see if there are – if there is something invariant, something that doesn't really change between executions that, for example, a malware signature could be based on in order to be reliable, and how many executions you would need to see in order to see such a reliable signature. And then we also looked at – we tried to conduct sort of an experiment to demonstrate what would happen if you tried to draw conclusions from a single execution – that is typically the way things are done when traces are collected in a sandbox.

Dave Bittner: So, what were the results then? I mean, what did you find?

Tudor Dumitras: So, let's start with the behavioral differences themselves. So, first of all, it is interesting that there are many reasons for these behavioral differences. The researchers previously focused primarily on this sandbox evasion behavior as a cause of behavioral differences. But there are many, many other root causes. There are, for example, differences in operating systems and the libraries available, as in the example that I give you with with with the Ramnit worm. We also saw that malware may attempt to perform some risky operations that fail on some hosts, and because of that, the subsequent actions will be different on those hosts. Malware may receive different commands from their C&C channel, so then at different points in time, they may do different things or may not do anything at all

Tudor Dumitras: We also saw that many of these perform an initial installation. So when you run the malware for the first time, you are likely to see different traits than when you run it the second time and the third time. And that's because the initial installation will perform some one-time operations, such as setting certain registry keys, for example. Perhaps not very surprisingly, malware often creates very random file names, so the file name itself may differ quite a bit between from one host to another.

Tudor Dumitras: So, the interesting thing about this that even if you are somehow able to catch sandbox evasion and deal with this in a sandbox, the traces will still not reflect the full range of behaviors that you are likely to encounter in the wild, because there are all these additional reasons for variability.

Tudor Dumitras: And we also saw that benign software also exhibits variability. If you think about it, Windows Update will perform different operations for each update, because it receives different things to install. And also it will differ from one host to another because of differences in the patch levels of those hosts. So, we see variability in benign software as well. However, in malware, there is more, and the variability is significantly higher in terms of the number of actions. And when I say variability, I mean the delta. So, you know, the fact that one host performs one hundred versus one host performs twelve actions, right? So, not the length of the trace, the variability within the traces of the same sample.

Tudor Dumitras: So, this is what we looked at. This was our main metric that we measured the variability, the per-sample variability, and this very significantly more for malware in terms of the number of actions that are performed. And the biggest contributor to this are the file creations. So, on some hosts, there are many file creations. On some hosts, there are much fewer file creations. So that's across hosts. And if we look across time, the main way that things vary is in missing actions. So, some actions that you see, at some point, several weeks later, they're not going to be there. In many cases, this because the malware just stops doing anything.

Dave Bittner: This variability, is it inherently risky? In other words, the fact that it has so many options and tries to do many things, does that make it noisier, and make it, you know, increase the possibility that it will be detected?

Tudor Dumitras: Absolutely. I think that the variability itself – this is actually one of the conclusions of our paper – that the variability itself can be a useful signal for detection. And I don't think, as far as I know, I don't think anybody uses it in this way right now. But it potentially could be one useful signal for figuring out if something is likely to be malicious or not.

Tudor Dumitras: In terms of the danger that this poses to the way we conduct business today, in terms of malware analysis and malware detection, we conducted one experiment in our paper with a malware clustering technique, which is often used to determine if an individual sample belongs to a known family. So, companies do this clustering in order to group samples into families based on this behavioral similarity. And the assumption here is that if you observe a certain behavior, then all the other behaviors will also fall into the same the cluster of the family. Otherwise, you cannot really conclude that it's the same family if the behaviors are so different. And this in fact what we observed. So, typically when you do clustering you use only one trace per sample, and then the resulting clusters – at least the most obvious ones – indicate the malware families in your data set.

Tudor Dumitras: In our case, we used a clustering technique that is pretty seminal from, I think, maybe ten years ago. We tried to do the same thing, but with multiple traces for each sample. And we just threw these in without telling the algorithm that these actually belong to the same sample, so they should they are the same malware. And then what happened was that in thirty-three percent of the samples, there was enough variability across these four traces that traces of the same sample ended up in different clusters – so, that's as if they belonged to different families. And in fact, one percent, each of the four traces was in a different cluster

Tudor Dumitras: This doesn't necessarily mean that clustering is useless, but this really indicates that you should be very careful when you draw conclusions from experiments conducted with a single trace per sample. Because this kind of behavior of samples that end up clustered in different clusters, different families – this would not be observed if you only used one sample per trace – one trace per sample, I'm sorry. This suggests that the accuracy of these results that is being reported of mapping samples to families through behavioral clustering, is really lower than than previously believed. Because of this, this variability, this just one example, one concrete experiment that I'm telling you about, but this actually has broader implications also for malware detection and for malware analysis. The accuracy of these things is likely to be lower than you might expect if you only look at one trace per sample

Dave Bittner: So, where do you all go next with this? I mean, obviously you were, you know, you're partnering with industry and they will reap some of the benefits of the things that you found here, no doubt. Are there areas – I mean, has this piqued your curiosity? Is there more to be done

Tudor Dumitras: I think there's a lot more to be done. Like I said, I love working with the industry. And so, first of all, the results of our study are public. We have published our paper in a leading academic research conference. The paper is available and we are also available to answer more more questions if anybody is interested. So, other than our individual, our particular collaborator that worked with us on this study, the broader implications or the broader sort of conclusions – the bigger picture, if you want – are that this something that really should be taken into account when doing malware analysis and malware detection. These behaviors that you can extract from multiple traces. 

Tudor Dumitras: In general, companies, organizations that have antivirus products or do malware detection on end hosts in some form, they tend to collect very similar data to the one that we analyzed in this paper. As far as I know, they don't do much with it, but here we try to show just what could be done with it, what you could learn, and how it might affect your bottom line if you don't understand how this variability, which is a real thing in the in the wild, how is it likely to affect your experimental results?

Tudor Dumitras: I think in terms of going forward, I think one thing that I'm really interested in the bigger picture is that this problem that malware experiments can give a false sense of security. And what I mean by that is that we see a lot of academic papers and industry evaluations discussing new malware detection techniques that often report detection rates above ninety percent. And then, invariably, this high level of high performance is hard to reach in the real world. The question is, why is that? Part of the answer is that when these techniques are developed and also tested using traces from a sandbox, then they may seem that they work better than they really do, because they don't capture this broad range of behaviors that happen in the wild. So this one reason for this false sense of security. But ultimately, I would like to understand the full picture and how much each potential factor – there are other factors, of course, that they contribute to this – and I would like to understand how each of this contributes to this accuracy degradation that folks observe from experiments conducted in the lab to when they deploy their tools in the real world.

Dave Bittner: Our thanks to Tudor Dumitras from the University of Maryland for joining us. The research is titled, "When Malware Changed Its Mind: An Empirical Study of Variable Program Behaviors in the Real World." We'll have a link in the show notes. 

Dave Bittner: The CyberWire Research Saturday is proudly produced in Maryland out of the startup studios of DataTribe, where they're co-building the next generation of cybersecurity teams and technologies. Our amazing CyberWire team is Elliott Peltzman, Tre Hester, Brandon Karpf, Puru Prakash, Justin Sabie, Tim Nodar, Joe Carrigan, Carole Theriault, Ben Yelin, Nick Veliky, Gina Johnson, Bennett Moe, Chris Russell, John Petrik, Jennifer Eiben, Rick Howard, Peter Kilpe, and I'm Dave Bittner. Thanks for listening. We'll see you back here next week.