The Microsoft Threat Intelligence Podcast 5.8.24
Ep 18 | 5.8.24

Behind the Scenes of the XZ vuln with Andres Freund and Thomas Roccia

Transcript

Sherrod DeGrippo: Welcome to The Microsoft Threat Intelligence Podcast. I'm Sherrod DeGrippo. Ever wanted to step into the shadowy realm of digital espionage, cybercrime, social engineering, fraud? Well, each week dive deep with us into the underground. Come here from Microsoft's elite threat intelligence researchers. Join us as we decode mysteries, expose hidden adversaries, and shape the future of cybersecurity. It might get a little weird. But don't worry. I'm your guide to the back alleys of the threat landscape. Hello, and welcome to The Microsoft Threat Intelligence Podcast. I am joined with Andres Freund and Thomas Roccia. Thomas is a senior security researcher at Microsoft, and Andres as a partner software engineer you've heard of before, I'm sure, because today we are talking about this crazy situation with the XZ backdoor, also known as CVE-2024-3094. Welcome to the podcast, guys.

Andres Freund: Hello.

Thomas Roccia: Hello.

Sherrod DeGrippo: So let's get into this. First of all, as we talked before, before we started recording, it's a pretty technical audience. So, for the most part, they should understand; however, I'm looking at some of the graphics that Tom has put together. And oh, boy; this is very confusing. So I think, Andres, let's start with you. Kind of walk us through what we're talking about here and what people need to know.

Andres Freund: Sure. To start at the top, a good -- bit more than a week ago, I found a security issue in SHD by -- and we can go into how I found that in a sec. And that turns out to not be a problem in the S -- OpenSSH code but in a library called LZMA, which is part of a package called XE. And it turns out that one of the upstream maintainers actually had introduced that backdoor in a fairly obfuscated way into it about a month ago. And it's a problem because SSH is obviously fairly widely used.

Sherrod DeGrippo: And so let me ask you some things. First, are you part of that open source project? What caused you to start looking around at this stuff?

Andres Freund: I'm not part of either of those projects. I work on open source PostgreSQL. And I found the issue while I was doing some profiling of code. And I was -- there was a new feature that I was reviewing, and we were trying to reduce the overhead in some edge case. And, for that, I looked at -- like, benchmarked before and after. And the performance difference was very small. So I had to quiesce the system, and I was occasionally seeing SSH using a fair amount of CPU. And that shouldn't have -- shouldn't be happening. Like, sure there were these internet scanning logins. But, like, it wasn't actually at a higher rate. So there wasn't any reason for it to use a fair amount of CPU. And then I investigated what was causing that CPU usage. And that, after quite a while, led me down to figuring out that there was -- that backdoor had been introduced.

Sherrod DeGrippo: So that backdoor was using up so much processing power, essentially, that it caused a red flag in terms of things that were using resources on the machine; is that correct?

Andres Freund: Yes. But I -- like, it wouldn't necessarily have been noticeable on a normal -- most systems because, like, as soon as you have Chrome open or something, that uses also a fair amount of CPU. But this is the most -- there was nothing happening. There was no graphical user interface running or anything. So, like, there should have been no CPU usage at all because it was idle. But occasionally there would be these 500 milliseconds, and that's worth 700 milliseconds somewhere of high CPU usage that would then go down again. And that's just not what I was expecting. And so I ran a profiler. And that was the first moment where, like, weird symptoms started to be noticeable because normally when profiling the system where all the debug symbols are installed, you should, like, have attribution of, like, where the time is spent including a symbol name. But, like, for most of it, actually, there was no symbol name visible. So there could be various reasons for that, like, that I forgot to install some or that there was some JIT compilation or something that could all cause this. But none of those should be the case here. So I started looking more at -- at what could be causing this. And, initially, I didn't think it was, like, a backdoor or anything. It was like a weird symptom. And -- but the more time I spent the weirder it got.

Sherrod DeGrippo: Thomas, I know, you've looked at this as well. You have a copy of the backdoored libraries and everything, and you took a look at it. What did -- what was -- what were your findings when you were looking at it?

Thomas Roccia: I think it's a crazy attack. This is super sophisticated. I never seen like that before, and I have been working on several investigation. So I think it's super impressive, the level of sophistication, the details, the technicity of the attack itself. And it's -- and there is still some element to uncover because it's so complex that we are still discovering new elements from this attack. So it's super crazy. And, actually, I was listening to Andres, and I had a question in my mind. At what point you realize, like, that, it was a backdoor? Like, at what -- what point you say, Oh, wow. The stuff that I'm doing now, is there is a backdoor?

Andres Freund: I think there wasn't a single point. I was seeing weird things. And then for moments I thought I had the explanation for the weird thing. So then, I was like, okay. Maybe it's not the problem. But then the next weird thing would appear. And, like, partially, that was actually caused by the build of some integration of the backdoor is so weird that doesn't actually work 100% reliable if you rebuild. So sometimes when I would rebuild the backdoor would vanish. And then it would rebuild again, and the backdoor would reappear. And there's also some concurrency bug in the make file. So if you built with too much concurrency, it's sometimes it wouldn't work. So it also then vanish again. And then there was all these countermeasures in the backdoor that, if you run SSHD, not from a system D service but, like, outside the backdoor, just deactivates itself unless you do some trickery. But it took me a while to figure out that trickery. But once I had figured out how to trigger to keep the backdoor active, even if, when I was, like, investigating how it worked, then it was very clear that there was an actual problem.

Thomas Roccia: And what did you think at that time?

Andres Freund: I don't quite remember the order of investigation anymore because it was all a bit crazy. And I think fairly late at night and stuff. But I think the first moment where it was just like, oh, there's definitely something weird going on was when the build system would end up building the same file multiple times and replacing it with different content. And there's no -- like, compilers are normally fairly deterministic, leaving a timestamp and stuff aside. And when there is like suddenly an object file that was like 50 KB or something larger than before, then that was clear that there was something bad going on. But, at that point, I still thought the problem was a distribution level problem, not an upstream code problem because if -- like, if you built from the source code, like, using Git, if you check out the code is in Git, it actually does not have all the components to make the backdoor active. You have to build from this package source that they distribute. And even then, if you just use your normal build commands, it will still not be active. You have to build it in a way that is targeted for packaging. So there has to be either like a Debian directory with a rules file in it or I think RPM_ARC -- ARCH has to be set to a AMD 64 or something for it to actually be accessed. And so, initially, we thought it was just a Debian problem because Git didn't reproduce it. But, at some point, I managed to actually find the sources from upstream were included it. And, at that point, it was clear that it was a upstream problem.

Sherrod DeGrippo: Does this mean that that that was in every version of SSHD?

Andres Freund: The problem itself is not actually directly in SSHD. There was no bug in SSHD. That's the tricky part. SSHD, if you built it from upstream sources, would also actually not trigger the problem. But many Linux distributions patch SSHD to integrate with system D startup notifications, which are used to, like, tell system D when the server service has fully started up. And the library that they were using for that actually has a dependency on lip LZMA, which is was the backdoor inserted. So only that indirect loading of the library would actually cause the backdoor to be active. So it's very indirect.

Sherrod DeGrippo: So I'll ask Thomas because, you know, Thomas and I have worked together on a couple of things at Microsoft because we're both, like, in security groups. And so, Thomas, help me -- help me understand from a threat intelligence perspective here. This seems very convoluted to get this backdoor on systems and to execute and to run. From a threat actor's perspective, how are they counting on this to work? How are they finding backdoor instances to make this work? Because traditional, like, malware to me would have some kind of botnet capability that would phone home to, like, some kind of command and control server or something like that. But this didn't have that. So how did that work?

Thomas Roccia: Yes. So that's a very interesting question. And I'm not sure if we will have the answer someday. But if you think about it, the process of Jia Tan is so sophisticated that that's the result of months and years of working. Like, the people that have been working on this backdoor have a high-level of understanding of Linux systems and compilation process of the J Lipsy. It's very, very impressive. And I think -- which is also even more impressive is the time behind the delivery of this attack because with the GitHub commits and the process of developing the trust, gaining from one of the ota, or from ota, we don't know at that time, is very interesting because this guy have been working on the project since the past two years. And there is multiple element that come to the story because there is also some additional persona, that have been part of the -- part of the story for making pressure to release this kind of tool, these kind of features at some point, which are very, very interesting. So this is -- from a threat actor perspective, I think that's a very interesting question. There is clearly a level of sophistication, which is unique, or at least, which is something public today. But it's very, very hard to say. I'm not even sure if one day we will know.

Sherrod DeGrippo: Okay. I mean, I know that you've done some tracking back of messages and commit comments and things like that. And, in fact, I think I saw a meme that the threat actor had actually said, like, it was a kitten, and it was, like, please give me a commit bit. And there was sort of a narrative of overwhelm of the original maintainer in terms of their personal life and things. Did you go back and look at those communications? What did you find there?

Thomas Roccia: Yes, I did. So -- so that's what is very interesting because during this process of, obviously, the user Jia T45. I don't remember exactly. So during the process of participating to this open source project, there have been, like, a lot of information, a lot of email exchange. He try to, you know, be part of the project, that at some point he tried to lead the project as well. And there is also some additional people that have been part of the -- part of the story because they also put the pressure on the original maintainer to change the maintenance itself to Jia. And that's very interesting to observe that because, since the past two years, the guy or these guys have been working very carefully to be part of the exec project and to, ultimately, push this backdoor and release it for the world if it wasn't detected by Andres.

Sherrod DeGrippo: So, Andres, let's talk about that, 500 milliseconds. So the story, the legend, the lore out there is that you saw 500 milliseconds of latency. And whatever is going on in your brain caused you to be like 500 milliseconds of latency is absolutely not okay. You've told me, though, that that's a myth. So what -- what is the deal with this 500-second -- 500 milliseconds?

Andres Freund: I think the myth goes that's what caused me to look into all of this. But the 500 milliseconds is actually true. Like, this -- it takes 500 milliseconds longer on my hardware to log in or to just even display like the help from SHD. But I only got there after I already had seen the CPUs at -- and seen that it was, like, having -- like, the profile was showing stuff being executed that it wasn't -- the debug symbols had. So it was like a secondary investigative perspective. Like, it was very useful to have that 500 milliseconds because it allowed me to debug when the backdoor was deactivating itself because it was trying to figure out how to reproduce it in a more contained way. And but if whenever I would start it in the shell, it wouldn't be active. So it's started in the shell, try to log in initially and then, later, just show help and see how long that took. And then you quit very quickly without needing to attach a debugger or anything. So it would see that it was -- whether the backdoor was active or not. And that's where the 500 millisecond played a role.

Sherrod DeGrippo: Can I ask you about your hardware? Like, so you were talking about the resource usage being an indicator. Is this hardware that you were talking about underpowered, overpowered, anything that would --

Andres Freund: Probably be overpowered.

Sherrod DeGrippo: Overpowered. Okay.

Andres Freund: Two-socket workstation with a cascade lake, Xeon CPU so 2 times 10 cores. So it's some new but it's also not underpowered. But because I was seeing -- trying to see those small performance differences --

Sherrod DeGrippo: You were able to see it with that.

Andres Freund: -- how powerful the hardware was because if I was looking at the small performance differences, I couldn't -- like, it didn't matter how many courses the machines had or something. I guess if I had a very -- like, on a laptop, it might have been a modern laptop might have been harder to see because, like, the enroll cores are a bit faster than on a larger server class CPU. And so the spikes would have been shorter. I had also disabled, like, Intel turbo, like the turbo boost stick where it goes above the baseline CPU performance. For a short amount of time I disabled that to get better reliable -- like, more predictable benchmark results. And that also make the -- made the period where the views are higher, longer. So that helped.

Sherrod DeGrippo: Got it. So let me understand this, too, because, from a timeline perspective, the first commit of that Jia Tan 75, who seems to have been the person to put the backdoor in was from February 2022. Did that person continue to develop and improve the back door over time? Or they put it in once and were done? Thomas, you seemed to, like, track some of that down.

Thomas Roccia: Yeah. So, basically, during this timeframe, he gained the trust of the main maintainer, probably of the community as well. And it was like a -- like an efficient contributor to the project itself during this time. And he released the backdoor just at the end. So there is some element before the release of -- of this last month, March 2024. There is some track that Jia Tan have pushed some commit to prepare the attack but not the attack itself, not the backdoor itself. But the backdoor have been pushed on March. So that's very interesting to see that. And there is also something that I wanted to say. I think that someone I think that's Brian Krebs that -- that put that in the light. You mentioned that the email address used by the user Jia Tan 75 is not known other -- in other place than GitHub or Debian least as well. So for example, it looked for this specific email address in public data leaks. And he didn't find anything. And he mentioned that it's very rare for someone to not be -- to not have at some point his email address leaked to one of the public data leaks. So that's very interesting to know that because that potentially means that there were personnel behind Jia Tan NGOs as well have been completely created for this specific purpose.

Sherrod DeGrippo: Andres, did you want to say something about the insertion of the backdoor over the timeline?

Andres Freund: Yeah. There was actually introduced in late February, like February 24. There was another release then in early March and where they were updating the backdoor. And that was likely to address some of the crashes the backdoor was causing in some systems.

Sherrod DeGrippo: So they were doing like a bug fix on the backdoor.

Andres Freund: Yes.

Sherrod DeGrippo: Great.

Andres Freund: And possibly also a bit more improvements. There were like, I think, some other changes; but we don't yet fully know what exactly the set of changes are. They made some other -- made them more robust in other ways than just the crashes that they had.

Sherrod DeGrippo: This level of code review makes me want to throw up. I cannot like, oh, my God. The amount of diffs, like, I -- yuck. I don't like doing that stuff. You two seem very comfortable with it, and I don't understand why.

Andres Freund: I'm an open source maintainer. Looking at diffs is my life, basically.

Sherrod DeGrippo: No, thank you. That's going to be a no thank you for me. I'm glad that you guys are doing that stuff, though, because I think generally, you know, software developers have a bit more patience for that kind of stuff and open source developers as well. Thomas, did you want to say something else on that?

Thomas Roccia: Yeah, yeah. Andres mentioned that then there is actually two version of the backdoor, and the one in early March and a few tools that could potentially be used for an extension mechanism, meaning there is actually a feature that is looking for a specific file in one of the folders in the test folder exactly with a specific signature. So it's not clear, but it looks like it's a mechanism used to extend the future capabilities of the backdoor. At least it could be used for that. So that's very interesting to see the difference between the first version and the second one.

Sherrod DeGrippo: Okay. So let's -- let's talk a little bit more about it sounds like if Andres had not found this, that the thread actor that put the backdoor into the library was planning to continue to develop it and add more things to it? Is that what you're saying?

Thomas Roccia: Potentially, yes.

Sherrod DeGrippo: Oh, jeez. Okay. So let's talk software supply chain. Everybody's favorite thing. How concerning is this? It sounds like it's pretty, pretty freaking alarming. I mean, I've been watching the responses on social media and in the news. And people are shook, I guess, is like the best sort of Gen Z vernacular I can come up with because I'm elderly. Like, people are really, really alarmed by this. And so, Thomas, I'll start with you. Like, from a security perspective, what -- what are we looking at here, and how can people protect themselves? Because software supply -- there are two things, I think, that really terrify me. And it's corrupted software supply chain and SIM swapping. I'm horrified by both of those things because they feel very impossible to fix. But, from this perspective, what can organizations do?

Thomas Roccia: Complicated question. So, first of all, I want to say that Andres did a fantastic job on that on the detection. And this is -- I'm also intrigued to see how these backdoors have been detected. It's super crazy. So very big kudos to Andres because this is very impressive. I'm pretty sure that, even if -- if most of malware researcher had to look at this, they probably didn't figure out what was going on. So that's very, very impressive, the detection of that. And this is also very concerning because, since it's so sophisticated, so obfuscated, the mechanism behind it, it's very difficult to find that. And it's hard to say, but it's -- if we detected this one, how many we didn't detect it. So, today, I don't know what is the good solution for that because it's rely on open source. And open source is supposed to be open. People have access to the code. But, even with that, it's -- like, if you look to this specific attack, it's super complicated to analyze and to understand. So I don't know what's -- what could be the solution. I'm not even sure if there is a solution.

Sherrod DeGrippo: Andres, do you have a point of view on that as a software engineer?

Andres Freund: Obviously, there's no way to prevent something like this to happen at a complete certainty. At the same time, there aren't that many projects with that degree of exposure. Like, I'm sure that I could insert -- somebody with some skills could insert a backdoor and, like, random open-source projects. But most of them won't give you this degree of access to this many machines because, like, while the dependency tree from Optimus H2 is bigger than we would like, there are some things in there that could be removed and that increase the threat compared to not having all those dependencies. But, at the same time, it was like I think 10 projects or something. And those are much more scrutinized and are going to be more scrutinized from now on than in the past. And we've -- also have caught a few other attempts over the years of introducing backdoors and earlier in the introduction process. So I think it's not all everything is lost kind of situation either.

Sherrod DeGrippo: Let me ask you one more thing around the supply chain, which is this. Your -- you've been in open source for a long time. Like, you're technically employed with part of your role is -- is open source work. Do you have advice or, like, wisdom for open-source maintainers or open-source contributors in making sure that they can do the thing that you did, which is find this problematic, vulnerable system?

Andres Freund: I think -- I don't know whether I have good advice. I think there are just basic stuff for engineering, things like if there's anomalies, be curious about what is causing the anomaly. It is -- that it's very useful for you for having to have experiences across multiple domains. I've done stuff on, like, many layers of the stack over the years. I've done some kernel stuff. I have some done some library development, database stuff. So that gives you, like, some understanding of how stuff fits together, and that's very helpful. But that's just generic software engineering helpfulness. For maintainers, I think one aspect is to be wary if people are trying to pressure you with guilt. That is a generally very, very common thing to experience as a open source maintainer. There are a number of emails of, like, how have you -- how dare you haven't added the feature I need? Or how dare you haven't addressed this bug report that didn't contain enough information to actually even reproduce the problem at all. Like, that is a -- your bread and butter, basically, as a maintainer. And, in this case, it was used to pressure the maintainer to relinquish control. So I think, if you get pressured to relinquish control, that's a pretty big warning flag, I think.

Sherrod DeGrippo: Okay. And so, on that, after finding this really -- which is amazing. I think everyone is dumbfounded, not only that that existed but that you found it in the way that you did. And it's very, it's an incredible combination of luck and skill, which we don't see a lot, right? Like, it's the luckiest shot from the most skilled person to be looking at it. It's very strange. And, like, I can't -- I can't believe that you were like, oh, this is weird. And that was enough to click you over to, like, keep going, keep going. That's a very security kind of mindset. How high is your paranoia now compared to like three weeks ago?

Andres Freund: A bit higher.

Sherrod DeGrippo: Okay.

Andres Freund: But just kind of emphasize how lucky it was. I actually was investigating this thing before, as it turns out, because some automated testing in Postgres had caused -- like, suddenly it started to fail on February, I think, 27 or something. And that was using ball grind. And it showed that there were some initialized memory access, or some -- I don't quite remember the details. And it had started looking at it, and the obfuscation in the build system defeated me at that point because I didn't have enough time to look more into it. So I just suppressed the warning for the moment, hoping that the new version would fix it and then stopped there. I'm not sure that I would have -- so, first, with the same set of skills, I didn't find it the first time. And then I'm not sure that the second time I would have actually found it if it hadn't in the back of my mind already been like primed for there's something really weird here.

Sherrod DeGrippo: Thomas, what have you learned through this process? Because I know that you have just been doing essentially, like, digging through old logs and digging through all communications and looking at the code and everything. What kind of takeaways do you feel like you have there?

Thomas Roccia: Well, I think -- I think that that's a wake up call first for the open source community but also for the rest of the community, not only InfoSec. I think that's something that we need to process and to continue to investigate, to learn more about it. And also to take potential conclusion and to, you know, move forward with that in mind. And this is also a very impressive sign that attackers are reinventing themselves constantly and trying to make something more difficult to analyze since the time I've done so. So, yeah. I believe this is a real wake up call for the open-source community but also for the InfoSec community. And there is a lot of stuff that we still need to understand and to uncover from this attack, not only from a technical perspective but also from a strategic perspective and potentially the destructor, which is behind.

Andres Freund: It's perhaps also worth noting that it looks like the attacker's timeline had to be accelerated because there were some people arguing that, for security reasons, he wanted to reduce the dependencies of SHD. And they had been working to do that for a couple of weeks or months. It wasn't yet clear whether it was going anywhere, but I assumed that they were seeing that the chance would be going away and were, because of that, accelerating the timeline. And it's possible that, if nobody had done this prospective security work, there was no concrete threat associated with it when it was being done. If that hadn't been in progress, they might not have introduced the backdoor when it was kind of slow. And without the slowness, I wouldn't have picked up on this. So I think it shows the value of doing some prospective security work to reduce your exposed code areas and stuff like that and to introduce new security measures.

Sherrod DeGrippo: Thomas, have you seen any other open source projects that are, like, scrambling to do code reviews now?

Thomas Roccia: Yeah. That's a good question. I think I'm even more paranoid than I used to be. So -- though, at the moment, I'm still working on it. So we will see in the future.

Sherrod DeGrippo: All right. We're going to wrap up. Andres, anything final you want to leave the audience with, anything that we should know that we didn't cover? No? No.

Andres Freund: Let's say just continue here.

Thomas Roccia: Again, before we drop out, I just want to highlight the outstanding work from Andres. I think the whole community owe you some meals.

Sherrod DeGrippo: Yes. If you see Andres out, you -- do you drink beer?

Andres Freund: Sometimes but not very often.

Sherrod DeGrippo: Sometimes. What do you do you drink, a cocktail? You drink a bottle of wine? What do you like?

Andres Freund: Sometimes wine. Sometimes scotch.

Thomas Roccia: Oh, scotch.

Sherrod DeGrippo: You see, Andres, out there, you live in the Bay Area, right?

Andres Freund: Yeah.

Sherrod DeGrippo: You see him around the Bay Area, you can buy him a scotch, a nice one, okay? A good one. And tell him thank you for basically saving the ability to communicate over SSH in a variety of Linux distros. All right guys. We'll wrap up. Thank you so much for joining me. I really appreciate it. Thanks for joining us on The Microsoft Threat Intelligence podcast. That was Andres Freund and Thomas Roccia, and we will see you next time. Thanks, guys.

Andres Freund: Bye.

Thomas Roccia: Thank you.

Sherrod DeGrippo: Thanks for listening to The Microsoft Threat Intelligence podcast. We'd love to hear from you. Email us with your ideas at tipodcast@microsoft.com. Every episode we'll decode the threat landscape and arm you with the intelligence you need to take on threat actors. Check us out, msthreatintelpodcast.com for more. And subscribe on your favorite podcast app.