Refactoring the Windows Kernel with Joe Bialek

Transcript

Nic Fillingham: Since 2005, BlueHat has been where the security research community and Microsoft come together as peers.

Wendy Zenone: To debate and discuss, share and challenge, celebrate and learn.

Nic Fillingham: On the BlueHat Podcast, join me, Nic Fillingham.

Wendy Zenone: And me, Wendy Zenone, for conversations with researchers, responders, and industry leaders, both inside and outside of Microsoft.

Nic Fillingham: Working to secure the planet's technology and create a safer world for all.

Wendy Zenone: And now on with the BlueHat Podcast.

Nic Fillingham: Welcome to the BlueHat Podcast Joe Bialek.

Joe Bialek: Thank you. Good to be here.

Nic Fillingham: Joe, who are you, and what do you do here?

Joe Bialek: So I am a security engineer at Microsoft. I've been at Microsoft for -- it'll be 13 years in January. And primarily what I do here is work on Windows operating system security. And since security is a broad subject, more specifically, my focus is typically around preventing the exploitation of applications running on Windows and of the Windows operating system itself.

Nic Fillingham: And, Joe, one of the reasons why you are on the podcast today is that you had a very well-attended, very well-regarded and reviewed session at BlueHat 2024, and we're going to talk to you about that session and about that content. That session was called Pointer Problems, Why We're Refactoring the Windows Kernel. Before we jump into that, though, would love to learn just a little bit more about your career to date, your time at Microsoft because I understand you did spend some time in MSRC, as well, right? So you were actually working directly on cases as they were coming in from researchers. Is that correct?

Joe Bialek: Yeah. That is correct. Yeah. My back story at Microsoft, it starts off with me as an intern. And so I did an internship at Microsoft when I was still in college. I was working on a team that was working on authentication between servers in Azure. It was security-related but not really hacking, not really the flavor of security I wanted to do, I guess. And so, when I ended up joining Microsoft as a full-time employee after college, I ended up joining a different team than the one that I interned on. And that team became known as the Office 365 Red Team. And, at the time, Microsoft did not really do red teaming. Most of the teams at Microsoft would do application security reviews, so they would look for bugs in code and then report the bugs. But we had a group of rowdy people on the Office 365 Red Team and finding some pretty interesting and severe bugs. And we kind of decided that we would start going a little bit further with them; and we wouldn't just report them, but we would actually exploit them and try to take control of data centers and try to evade detection from teams that were supposed to be detecting us. And the team just naturally stumbled into this red team role. Like, that was a super fun team to be on. Very informative for me, I think, early in my career. And what I realized when I was on the team was that the aspect that I was most drawn to was actually the low-level Windows stuff. Finding bugs and exploiting the bugs in these online services was fun and interesting, but I started building a bunch of hacking tools for our team to use so that we could take control of servers and build these tools really sneaky so that we wouldn't get detected when we were taking control of servers. And that was really fun and interesting for me. So, after a couple of years of being on the red team, I ended up switching to the Microsoft Security Response Center. And the reason that I made that move was because that's where there was a whole bunch of people doing interesting low-level security work, and I really wanted to learn how to do that stuff. And I was on that team, yeah, for quite a number of years, actually. I think I left in 2021 maybe. So I was on the team for, I don't know what the math comes out to, eight years or seven years. It was quite a while. It seems like it was quite a while that I was on the team. And, yeah. I handled MSRC cases while I was on the team. And I also started working on mitigations while I was on the team. So we see a bunch of cases come in, same general pattern. And I started thinking, hey. I'm tired of working on the same case over and over again. So how about I build something that just stops this from being possible? And then I don't need to work on these cases, and I can work on other stuff that's more interesting. Yeah. And so that was the start of my mitigations journey. That's still what I'm doing today. In 2021, I moved to the Windows Security Team. My job really didn't change. I just felt that, organizationally, it was a better fit for what I was doing. Back when I joined MSRC, Windows did not have a huge security organization. But, when I switched the Windows Security Team, Windows actually did have a very big, mature, respectable security organization. So, since I was mostly focused on Windows security work, it seemed to make sense to just be on the Windows Security Team. So, yeah. That's kind of been my journey at Microsoft.

Wendy Zenone: I love that you were an intern and then just continued your journey. When you were creating tools to break into things, do you have to get permission to do that? Like, how does that work with the red team? And I know we're going to dive deeper into some other things. But do you just do it and then ask for forgiveness? Or do you let some folks know, like, Hey. We're going to be doing this thing.

Joe Bialek: Well, early on, I think it was do it and ask for forgiveness because, if a red team is not a thing that exists and then you go to some executive and say, Hey. You know, we're going to have this rowdy group of individuals break into your production services that is hosting whatever service for millions of customers, we're going to break onto that and take control of it.

Wendy Zenone: You okay with that?

Joe Bialek: What do you think they're going to say to that? Or they're just like, Are you out of your mind? There's no way we want you on our servers. So, yeah. We had air cover from our direct management. And this is all from memory. I mean, this 13 years ago or so, right? But we had air cover from our management to do this stuff and some level of executive support, but I don't believe we had executive support from the actual product owner. And --

Wendy Zenone: Spicy.

Joe Bialek: These people were quite upset when they found out what we had done, as you can imagine, and were pushing back against what we were doing, saying, Hey. This isn't right. You shouldn't be doing this. Who are you? Who are you? What gives you the right, you know? Anyways, you know, I think everyone's first reaction getting hacked is they're just, like, pretty upset about the whole situation and get defensive. And I think that's a super understandable human reaction to it, right? But they're upset because, you know, it's kind of egg on their face. But then when they say we want to shut this down, then the people that sponsored the team in the first place are like, wait a second. Why did you let this team hack into you in the first place? This isn't their fault.

Wendy Zenone: Right.

Joe Bialek: This is on you, right? Like, your job is to protect the service. And these people went and figured out ways that the defenses are lacking and that we need to improve it. So it can be kind of a little bit of a hostile relationship, though, because people feel like you're making them look bad. And, to a certain extent, you are in some ways. There's no sugar coating it, right? If you hack into someone's thing and they wanted it to be secure and they were telling everyone it was secure, then it makes them look like, Hey. You told us one thing, but you're delivering something else. But, over time, the red teams kind of evolve.

Wendy Zenone: But the alternative of someone externally hacking into them sounds a lot worse than an internal team.

Joe Bialek: That's right. And so that's the argument is, hey. At least you know about what we found, and you can go and fix it. If it was someone else who didn't work here, they might not be so nice. You might find out about this problem because everyone's email's leaked to the internet or something along those lines, depending on what the service handled. So people still don't love it. I think that, over time, we had to put a number of processes in place. I think there's some reasonable concerns that people have around, hey. You know, if we end up detecting you, or you end up causing a serv -- we don't want our engineers to have to work all weekend long, be up all night, time away from their families to deal with what is effectively a test. And, yeah. You could argue, well, if this was real, then you'd have to do that. But it isn't real, right? It's not a real attack. And you're not really building friends by making people work weekends because you want to hack into their thing, right?

Wendy Zenone: Right.

Joe Bialek: So I think that, over time, the red team kind of put processes in place to smooth that kind of stuff out to make sure that we can do our testing. But also we're not, like, actually ruining people's work-life balance and whatnot, you know, just in the name of performing a security test.

Nic Fillingham: And that's sort of the ethos, or the principle, really, around spinning up red teams and spinning up sort of bug bounty programs and any sort of vulnerability discovery and response where you're trying to balance, okay. Let's create some degree of safety where ethical researchers, ethical hackers are able to go out and throw egg on our face but in a safe way that doesn't ultimately compromise end users and customers or expose real data to the world, et cetera, et cetera. And so I think what you've described here is probably a similar journey that maybe the vast majority of red teams, as they get spun up, not just at Microsoft but across the industry go through, right? Like someone curious, what was your -- you said a group of. You talked about your fellow red team as you were a group of -- you didn't say miscreants. What did you say?

Wendy Zenone: Rowdy.

Joe Bialek: Rowdy, yeah.

Nic Fillingham: Oh. Rowdy. Yes. A rowdy group. You're like what? We still need to make a point here to sort of, you know, push our group, push our industry, push our product to integrate a safety. And so that sounds like the path that you took. I think this is a great conversation. I'd love to keep going in this, but I also want to talk about your BlueHat session, Joe. So, if we could pivot a little bit, if that's okay, I mentioned the title before. It's up on YouTube right now. We'll make sure that the links are in the show notes. Pointer Problems, Why We're Refactoring the Windows Kernel. Maybe we'll come back to red teaming a little bit later in the conversation. One of the things that Wendy and I like to do on this podcast is we get an opportunity to ask some sort of fundamental questions to ensure that all of our listeners but really just Wendy and me know what others are talking about.

Wendy Zenone: Yeah.

Nic Fillingham: We're going to ask you a couple of rudimentary things here to sort of just help sort of set the playing field. So it's probably a big answer, but can you just help set the stage? What is -- in the context of your session, what is a pointer?

Joe Bialek: Yeah. So pointer is just a value that represents location in memory. So it's like a street address, how a street address points to a house. A pointer says, here's some location in virtual address space and has some data in it, presumably.

Nic Fillingham: And these are often, like, very long hexadecimal strings; is that right?

Joe Bialek: No. The size of a pointer is defined by the processor, and pretty much everyone today is using 64-bit software. And that 64 bit really means 64-bit pointer. That is the difference between a 32- and a 64-bit processor. Old computers use 32-bit pointers. New CPUs tend to use 64-bit pointers, although we can still run emulated old 32-bit software on the new processors.

Nic Fillingham: Got it. All right. Pointer. Wendy, what's your question? What's your term we're going to have Joe describe?

Wendy Zenone: So many. But refactoring, from what I think it is, and tell me if I'm wrong -- this is the test for me. Now, is this the low-level code, like the C++; and then let's say you're writing Python. Is Python refactor? Okay. You need to just tell me. I'm going to stop guessing. What is refactoring?

Joe Bialek: All right. So, yeah. Refactoring means that we are rewriting the code or at least changing the code in some way, maybe not a full rewrite of the code; but we are doing a lot of changes to the code. And the code that we're talking about in this presentation is C and C+ code.

Wendy Zenone: Why?

Joe Bialek: Why? So the reason that we are changing this code is because the code is written in a way that currently mostly works but is not guaranteed to work based on the C and C++ language specification, which means that the compiler's optimizer could perform clever tricks underneath us to make our code run nice and fast like we like it to be, but those optimizations could break assumptions that the developers who wrote the code had about what the code is doing. And so the code won't actually work correctly if those optimizations are performed. So we need to change the code so that it follows the rules of the programming language such that, no matter how optimized the code becomes, it still is functionally correct.

Nic Fillingham: And, then, refactoring, my sort of limited understanding of refactoring is that you also -- you can refactor for a bunch of different reasons. One, you can refactor for efficiency. You can refactor because you want to make the code more readable and understandable by humans, right? You want to give maybe variables understandable names as opposed to sort of somewhat random descriptions. In this case, is that what you're trying to do when you're saying you're refactoring the Windows kernel? Is this about efficiency? Is this about readability? Or are there other goals that you're talking about in this particular session?

Joe Bialek: So the primary motivation for these changes is security. There are also readability improvements due to the refactoring that we're doing. I think a lot of the new code is actually much more understandable than the old code. But the primary motivation here is security.

Nic Fillingham: Got it. And, then, one thing that you talked about in the session, and I do really encourage folks to go and watch or listen to Joe's recording on YouTube, is how -- I think you said there was a team of six or seven of you. And you took two weeks, and you went to try and refactor 10,000 things; and you got through 1300 or something. And I'm trying to remember the numbers that you said in the session. But the question that I would love you to explain, if it's not a crazy long answer, is, why do humans need to do this work versus some sort of find and replace rules that you could run across the code, right? If you're just trying to make it a bit more readable, could you not apply some semi- or fully automated processes to do that?

Joe Bialek: It's certainly possible that you might be able to use some sort of automation to do these changes, for some of them, at least. But it really is preferable to have humans do these changes for a handful of reasons. One reason is that, in some cases, the code that currently exists is actually broken. It is insecure, even without compiler optimizations in place. Like, the code is just not correct code. And, as we go and review the code to refactor it, we realize, oh, this is just unsafe. And so we actually have a number of vulnerabilities that have already been fixed as part of this process because we've found those issues as we've done the refactoring work and then gone and opened MSRC cases to have the fixes for these back-ported to all affected versions of Windows. Another reason is that, in a lot of cases, we're not just doing simple find and replace on things. There are actually more changes that need to be done to the code to make the code look right and function the way we want it to and still be efficient. So just saying we're going to do search and replace, in some cases, sure. That makes sense. But, in other places, it doesn't make sense. We want to change the way the code is written a little bit so that it is still efficient. And the only way that we can really make that determination, is this a simple fix, or does this need a little bit more work, is by someone taking a look at the code who understands what the code does and making that judgment call. And once you've gone and taken a look at the code to make that judgment call, if it is one of these find and replace cases, it is very easy to just go and fix it yourself. Automation wouldn't really help you in that case. The fact that we want to have human eyes looking at everything effectively means that we're also going to have humans do all the changes.

Wendy Zenone: If we could just step back a little bit, we're digging into this, which is great. But I want the listeners to have, if they haven't checked out your talk on YouTube and watched it or listened to it, can you give us a brief overview of the talk that you gave at BlueHat and then what inspired the talk, and who was this talk intended for? Is it for a specific type of developer, a specific type of security engineer? If you could, just give us a little lay of the land of your BlueHat talk this year.

Joe Bialek: Sure. Yeah. First question, I believe, was -- maybe I'll answer the questions in reverse order because I'm going to need you to remind me what some of the earlier questions was.

Wendy Zenone: Sure.

Joe Bialek: Who was the talk intended for I think was the last question. And --

Wendy Zenone: Yeah.

Joe Bialek: -- this talk is intended for multiple audiences, I think. The first audience is security researchers that are looking at Windows kernel mode components are going to be interested in this talk because they're going to be seeing all of these changes that are coming through and probably wondering why these changes are coming through. And they can help us identify areas that we've missed with these changes. Someone might look at this code and say, Hey. You did all this refactoring. I've seen the talk. I know what you're trying to accomplish with this refactoring, but you missed these three spots; or you did things slightly wrong here. So I think that having context for why changes are being made can help you when you're looking at changes to understand if they were done correctly and completely. I think that this sort of talk is also presumably interesting for developers who are working on kernel code or even user mode code because they're going to be impacted by these changes that are coming down the pipeline. This affects both Microsoft developers but also third party developers. Once we're done getting our own house in order, we are going to be asking that third party developers outside of Microsoft also make similar changes to their drivers, and so I think it's useful for people to know what's coming down the pipeline and why those things are coming down the pipeline.

Nic Fillingham: So, Joe, I think just building on Wendy's question of who is this talk for, so you talked about Windows developers at Microsoft. And then you talked about some external developers, specifically people writing drivers. Are those the only two scenarios where you have the people building the operating system writing the kernel, and then you have the people that are creating drivers that need to talk to the kernel? Or are there other sort of applications? Are there other sort of bits of technology that this is applicable to? Because whenever I hear kernel, my rudimentary understanding of all this is that, oh, this is the core of an operating system. And the only people that touch the kernel, in terms of who -- are writing the kernel are the people that maintain the operating system. And in this case it's Microsoft. But you've talked about third parties that write drivers. So who else touches this?

Joe Bialek: So, for most of these changes, it is kernel developers and driver developers. But we have many thousands of third-party drivers that run on Windows, so that is actually a fairly large ecosystem. Some aspects of the top definitely apply to all parts of the software stack, though. Some of this stuff applies to user mode, user mode applications. Could apply to boot code, EFI, UEFI, things like that. But certainly aspects of what I talked about apply to user mode applications, which is where most of the code on a computer ends up running. But the talk was definitely more catered towards the kernel audience because that is where we are focusing our efforts right now with all of the tooling and detection support for finding issues and getting issues fixed.

Wendy Zenone: All right. Now for my question of can you give an overview of your BlueHat talk, please.

Joe Bialek: Right. So the talk is about a series of different types of problems that developers have when they are reading through pointers or writing through pointers. And developers oftentimes have a lot of assumptions that, when they are writing C or C++ code, the basic operations that they write in their code translate down to assembly in some particular way. So they might say, hey. I'm reading a single byte from this pointer, and so I expect that is going to be translated by the compiler into a one-byte move instruction. And what the hawk was highlighting for people is that that is, indeed, what the compiler does. But the compiler is not required to do that. And there are certainly cases where the compiler does not do that. And it can cause correctness issues in your code if the compiler does the thing that you don't expect, which can lead to security problems. It can lead to reliability problems. It can lead to your program just doesn't give the right output for the input it received. It's just not a good situation to be in. And there are things that you can do in order to make your code do the thing that you're actually expecting, but a lot of developers aren't really aware of what those things are. And it's very hard to detect these situations. And so I was talking about some of the tooling that we've built to help find these bad programming patterns that people have implemented so that we can go and fix them.

Nic Fillingham: Joe, when I was watching your talk, one of the questions that I had -- and, again, I asked very rudimentary questions, in part because I'm not very smart. But one of those questions was -- you guys are meant to laugh at that, by the way. That was a self-deprecating joke. Anyway, thank you for laughing. Why is the output of the compiler so mysterious? Like, why is it that -- because there was a number of times in the session where you said, Oh, you might expect a compiler to do this. And sometimes it will, but sometimes it'll do that. And sometimes it'll do this. And I kept thinking, hang on. Why would a compiler output something different if the code is the same? Shouldn't it be deterministic in that sense? Can you explain why sometimes the compiler behaves one way and sometimes the other?

Joe Bialek: Yeah. Certainly. The answer is, it's very complicated. But --

Nic Fillingham: Okay. Thanks. Moving on. Next question.

Joe Bialek: Broadly, though, the compiler's job is to make your code go as fast as it possibly can while still being correct, right? And, also, the compiler's job is to make your code be small because people don't want to have super bloated code. It's more expensive to ship updates to people. It actually can cause performance problems in and of itself. Small code, fast code, that's what we want, right? When the compiler is evaluating your code, in order to perform these optimizations that it wants to do for you, it has a lot of rules that it has to follow to make sure that it is not breaking your code, right? And so it's constantly evaluating things saying, Hey. Can I do this optimization? Can I do this other one? But the problem for the compiler is that it does not have perfect knowledge of all of the code that is running, right? And so the compiler kind of looks at bite-sized chunks of code and analyzes those chunks and tries to make them go fast. And it turns out that what a bite-sized chunk of code oftentimes means in a compiler is it looks at code one function at a time. So it'll look at a function, and it'll say, okay. I'm going to make this function go really fast. And the compiler might see things in a function where it says, Okay. I see that you're doing operation X. And operation X might be safe. Like, it might not break this optimization that I want to do, but I can't prove that. And so I have to be conservative and just not do that optimization because I want to make your code go fast, but I have to make your code correct. So correctness is mandatory. Performance is best effort. And so why do you see code that looks somewhat similar get compiled so differently sometimes? It all comes down to what else is going on in that function, right? When you look at a code example, you might say, oh, I'm just doing this basic operation. Yeah. But, in the second example, you have some other thing that happens between operation A and operation B, right? And that other thing that you're doing in your function might freak the compiler out and make the compiler say, Okay. I can't perform this optimization that I want to do anymore because I don't know what's going on in here. Or I can't reason completely about what's going on in here. So that's typically what happens is that some code just compiles fine because, due to all the other stuff you have going on in your function, the compiler says, I can't do those optimizations. It's too scary. I can't prove that it's safe. And then you tweak a few things in your function, and all of a sudden the compiler says, Oh. Actually, I can prove that is safe now to do. And so I'm going to do that optimization for you to make your code go faster. And it's completely unobvious to developers, right? You say, I tweaked one random thing in this function; and all of a sudden the whole thing's broken. That one tweak you did all of a sudden allowed the compiler to prove that it thinks that something it's doing is safe.

Nic Fillingham: Thank you. I get that. I want to ask just a follow-up. So, if I have two developer machines side by side and they're identical, and they're running -- everything's identical, and I put the same code into each of them identical, and then I hit Compile, should the output be identical on both? Or are you saying that there could be some slight variation from one to the other where the actual output may differ? Or, in theory, would they be identical?

Joe Bialek: Okay. So you want a simple answer, but nothing with compilers is simple. So the answer is, it depends.

Nic Fillingham: Why not, Joe?

Joe Bialek: Yes. Actually, in a lot of cases, the output would be the same because Visual Studio, I don't know if we have documented support for this, but for Windows, at least, Visual Studio supports reproducible builds. And I believe that other compilers support reproducible builds as well. And what reproducible builds means is that, if you feed me the same input, you will always get the same output. As far as input being the code that you want me to compile and output being the final binary that I spit out, they should be identical. And that's needed for, you know, all the supply chain stuff and whatnot, right? You want to be able to verify that, Hey. You know, someone didn't inject a back door in this code as part of the build process; so I should be able to build the code that was built 10 years ago, or whatever, right, back when this binary was produced. And I should get that same result if I use the same compiler, the same code base. If you're not using reproducible builds, then the binary itself can change. And the way that it changes is up to the compiler. I would say, in general, you wouldn't really expect the optimizations to change that much, at least. But they could, I guess. I mean, if you don't have a guarantee that the builds are reproducible, then it's up in the air what you're going to get, I guess, would be the short answer there.

Nic Fillingham: It sounds like compilers are basically teenagers.

Joe Bialek: Yeah.

Nic Fillingham: Where you might give them the exact same instruction on Monday and they perform one way, and then you give it to them the following Monday and they do something completely different. That was much funnier in my head. Okay. Team, what was your next question?

Wendy Zenone: So we touched on a little bit the refactoring of 10,000 spots in the kernel. Is that done? Or the net two weeks, was it finished? Or do we still have work to do on that.

Joe Bialek: There's still plenty of work to do, but we're making good progress. So the way that we're staging a number of these projects is that we try to start by refactoring the kernel itself. And what I mean by that is the ntos kernel binary, which is the core kernel of the operating system. It's not a driver. And the reason that we like to start work on the kernel itself is because, one, we don't need to worry about implementing driver support, which can be painful for some of these things. The kernel is fully self-contained, so we can go -- and it's easier to implement these technologies just for the kernel to use. By focusing on the kernel first, we can also prove out that these things that we're doing work and are performant and don't have any weird gotchas that we didn't foresee when we were spec'ing out the feature. And we get ahead of all of that stuff prior to exposing it to drivers because, once you expose interfaces to drivers, it becomes a lot more painful to go and make tweaks to things if you realize, oh, we actually want to do things slightly differently. Right now you have -- oh, this driver is depending on it the way that you exported it already. We start with ntos kernel. And once we're happy and once we prove out that it works with ntos kernel, then we can go to every other driver and say, you know, we already have proved that this works in the kernel. In ntos kernel specifically, we have, you know, ported everything. You should be able to. And so that's where we are with user mode accessors, specifically, which is one of the things that I talked about. And that's the thing that you were asking about, Wendy, the 10,000 spots that need to be refactored. That is specifically places that are interacting with user mode memory. So I think at the time I gave the talk we had ported about 1300. I think now we are somewhere over 2,000. I don't know exactly where we are over 2,000. But we are -- I think last I looked we're about two-thirds done with ntos kernel and making good progress to getting all the way done. And once we're all the way done with ntos kernel, then we'll expose it to drivers. And then we'll start working on getting all of our drivers converted. For one of the other topics that I discussed in my talk, though, which was the concurrency issues that we were using KCSAN to detect, we have actually completed our pilot with ntos kernel. And we fixed lots of bugs in ntos kernel the concurrency sanitizer found. And we have actually scaled that out to drivers across Microsoft now and are making solid progress getting all of our drivers ported and, I don't know, correct, I guess you would say, under the KCSAN rules. So, yeah. We're making good progress. But these projects are huge, and they take a lot of time. So I always go into them assuming that it's a multi-year effort because it almost certainly is always a multi-year effort.

Wendy Zenone: And to add on to that, how do these changes strengthen security for the average Window user?

Joe Bialek: Sure. So these changes strengthen security for the average Windows user in a couple of ways. The first way they help improve Windows user security is, because the programming model itself is fundamentally stronger, it makes it less likely that we will make mistakes when we are writing new code. And that's great, right? You don't want to have not -- take a new operating system and it's introduced a vulnerability that the old one didn't have. This model aims to prevent that for these specific types of bugs, at least. And the other way that it makes people safer is that we are eliminating classes of issues in the platform that, you know, in some cases, they're not affecting anyone yet. But, as we've talked about, compilers can be seemingly unpredictable sometimes. And so we're reducing the risk that some new compiler optimization that's really great for performance comes along and just makes all of our code unsafe. And so that would obviously be a nightmare for Windows users. So we're getting ahead of that problem now so that, you know, you think about it, like, you don't want to build a house on top of a shaky foundation, right? So we're going through and trying to make sure that the Windows kernel mode foundation is absolutely rock solid for people to build apps on top of.

Nic Fillingham: I'd like to ask you, Joe, about researchers. So this is the BlueHat Podcast. BlueHat is obviously focused on security research. The work that you're doing here, you talk a little bit about in the session how there was some number of 56 cases or something I think that was used to sort of identify a broader issue. Moving forward, though, is there a request or guidance or tips to the research community of what they can do to help test out this work and/or further research ways to improve it? Do you want to hear from researchers where they are able to crash the kernel in some of the ways that you're referring to as part of this refactoring work? Or do you want them to go and reverse engineer something and look for things? Like, is there sort of a go to do here maybe for the research community or perhaps even just the takeaway that is applicable to researchers based on this work that you're doing?

Joe Bialek: In the immediate future, probably not. We're, of course, always happy to -- not that happy but happy to receive reports that you can crash our kernel and certainly want to fix that kind of thing. But with relation to these specific projects, because right now we are just in execution mode trying to get all of our code ported, there's not really a lot of asks that come to mind for me for the research community right this second. If they find problems with the stuff, then absolutely report that to us. In the future, though, I think we'll definitely be interested in working with the research community and seeing if people, for example, can find problems where, hey, you know, forgot to port this area that does interact with user mode memory or other oversights that we might have made as part of the porting. But right now, since we know we're not done with porting, there's not really an ask that we can make to researchers, right, because they'll just send us a bunch of reports that won't be useful for us. We'll probably just say, yeah. We know that's not ported yet. It's on the list. So I don't want to waste anyone's time until we get to a state where we say, all right. We think we're done. Do you agree?

Wendy Zenone: What are some go do's for developers? Like, developers watch your BlueHat talk on YouTube. What should they go do?

Joe Bialek: So I think one thing that developers could start doing now to prepare themselves for the future is with respect to user mode memory accesses. We haven't published the new user mode access, or API set, for developers yet. But developers can go and implement their own wrappers around user mode accesses today, that under the covers just do the same thing that their code was doing. But by going and actually annotating that in your code, it'll make it easier for you to adopt the new APIs when we do publish them, which should be published as part of the next Windows release because you'll already have done all the homework to identify what the spots are that you need to go and convert. And I think the other takeaway for developers is just to be conscious about what optimizations can happen from underneath you and ensure that you are actually writing code correctly based on that because we have a number of cases where developers were not aware of what the compiler could do underneath them. So they thought they were writing correct code. Turns out it's not actually safe. So it's an educational thing was part of the purpose of the talk to say this is what the compiler can do from underneath you. And, if you want your code to be safe, you need to be using these APIs, not doing things the way that you were doing before. And the talk does list out some of the APIs that you can use, like for safer lock list programming and whatnot. So that's an immediate takeaway I think the developers can have. We don't have the detection tooling published yet for some of this stuff, and we're working on that. You know, we unfortunately don't have the ability to go and tell developers, Here. Go run this tool. It'll find all your problems for you, and then you can go and fix them. We're not quite there. But we're working on getting there, and hopefully by the next Windows release all that stuff will just be available.

Wendy Zenone: And what are some lessons learned from the refactoring process that might be applicable to others?

Joe Bialek: So I think one of the things that we had done early on and -- or this is going to get maybe a little into the weeds. Feel free to ask me if I need to explain things, but --

Nic Fillingham: Explain weeds. Oh, sorry.

Joe Bialek: Yeah. So when we had our V team working on making these changes, one of the things that we were looking at doing was changing exception handler scopes in the kernel. And in some of our kernel code we'd look at it and we'd say, okay. You know, you're doing a user mode access, and so you have to put that inside of an exception handler scope. But your exception handling scope is really broad, and it has a whole bunch of other stuff inside of it that doesn't need to be there and probably shouldn't be there. And so we took it upon ourselves to go and clean some of that up, to shrink down some of these exception handling scopes so that, you know, leave the code in a better spot. And we actually ended up breaking the code in a couple of situations because exception handling scopes are a very nuanced thing. And it was very easy to miss situations like, oh, this function that we called inside of an exception handling scope does have a return value, but it also can throw an exception. And so we would just -- we made a couple of mistakes, and one of the things that we ended up deciding for this project was cleaning up exception handling scopes is a righteous change to making code, but we don't want to do it at the same time that we're doing all this user mode access reporting because we are basically adding more risk to the project. And it's already a very risky project to say we're going to go and refactor 10,000 locations. So adding even more changes that aren't strictly needed for security, like it's more of a righteous change but not strictly necessary, that's just too much risk. And if we tell everyone to clean up their exception handling scopes and they cause all of these failures, then it could give the project a bad name, right? And people might say, Hey. You guys haven't really thought this through. This is causing lots of problems. We shouldn't do this. And so we don't want to strive to perfection and then have it kill the whole project and we don't get the goodness that we really need.

Nic Fillingham: That's a great takeaway. Or I should say it's definitely a great takeaway. My follow-up question here is how did you determine that there would be too much risk if you added that to the refactoring process? Like, are there numbers you can crunch that can actually give you sort of a risk score in this, or was it really about whiteboarding it out and deciding we're just going to do this one thing. We're just going to have this one goal and this one task?

Joe Bialek: No. There's no algorithm for this. The way that we determined is, we said, hey. We had a team of six developers that were all skilled developers. And we were really trying to be super careful, very diligent about this porting to make sure that we did things the right way and we didn't cause problems. And we still made mistakes. And then you kind of extrapolate and you say, not that we don't want to talk bad about other kernel developers, right? But it's one thing to have a small team of six people that's passionate about the project and deeply cares about this stuff, and it's another thing to go and file bugs against the whole developer ecosystem and say, hey. You need to go and fix this stuff. Here's changes you need to make. Right? The level of diligence that everyone else is likely to put in is probably, on average, going to be less than the amount of diligence that our V team that is, like, deeply passionate about this stuff put in. So if we're making mistakes, then it's likely that other people are going to make more mistakes is basically the way that we look at this. And that was enough for us to say we're messing this up. This is clearly very tricky, and it's just not worth the risk. It's not strictly needed for this project, so we shouldn't add it. If it was just, like, a freebie, like, you can do this. Great way to clean up your code. No risk. Sure. Throw it in. But, like, it is risky; and so we don't want that risk as part of the project.

Nic Fillingham: Yeah. That's a great takeaway, sort of just understanding how to set limits for yourself and how to understand when and where to be conservative. We are coming up on time, Joe. I wanted to ask you a couple of follow-up questions that one that is related and then one that's not. The first related is you mentioned very early in your introduction that you really enjoy this space. And I wanted to ask you, what do you love about this work? What do you love about low-level code? What do you love about the kernel? What is it that gets you out of bed in the morning for this stuff?

Joe Bialek: Sure. Yeah. What gets me out of bed in the morning is my daughter. But what gets me out of bed in the morning for work is I just think this stuff is, I don't know. I find it really fascinating, especially things around memory management and mitigating bug classes that are very challenging to mitigate is just a very interesting intellectual problem to think through. How do we stop a bad thing from happening but still being compatible, still having good performance, still having a scalable platform? Yeah. It's just it's very interesting. And there's so many trade-offs that you need to make when you're doing this kind of work. And it just makes it really fun to think about. I don't think it's very fun to work on problems that have easy, obvious answers. And this is a space where, actually, most of the problems don't have answers. There is an answer, but it's not a shippable answer because it would tank performance or have other bad side effects for the platforms. It is very interesting to work in a space where success is not guaranteed. Many problems don't have acceptable solutions. But there are problems that do have good solutions if you can find them. And so that's the trick is finding them. And I also just work with really smart, amazing people. I find that I'm happiest when I'm learning, and when you work with really amazing people, then you just constantly feel like you're the idiot in the room. You're always learning, right? And that's the best place to be, as far as I'm concerned, just constantly getting smarter.

Nic Fillingham: I love being the idiot in the room. That's my happy place.

Joe Bialek: It's the best place to be.

Nic Fillingham: This is a audio only podcast, but sitting behind you on the wall is a Pink Floyd '87-'88 World Tour poster. Is that an original poster, or is it a reprint?

Joe Bialek: I have no clue. My friend went on vacation one year when I was in high school and bought this poster from, I don't know, some shop that he found. And the cardboard was beat up. Like, it wasn't -- he didn't buy it new. But I have no idea if that is an original print. I certainly wasn't at the show. I wasn't born.

Nic Fillingham: This is 1987 so that's giving us a little bit of a clue. But are you a Pete Floyd fan?

Joe Bialek: Yes.

Nic Fillingham: That was going to be my next question.

Joe Bialek: Yes. I'm a Pink Floyd fan.

Nic Fillingham: Lovely. And are you also a Metallica fan? Because I can see over your left shoulder is a Nothing Else Matters. I can only read those words, but I think it's maybe the lyrics of the song in the shape of -- is it Kirk Hammett's guitar? No. Whose guitar is that?

Joe Bialek: No. It's James Hetfield's. Yeah.

Nic Fillingham: Oh. James Hetfield's. It's an Explorer, right?

Joe Bialek: Yeah. It's an Explorer. That's right.

Nic Fillingham: Beautiful. Are you a guitarist? Are you a rock and roll fan? What do you want to tell us about your life outside of kernel refactoring?

Joe Bialek: I'm a hobbyist guitarist. Based on the amount of guitars you have on the wall behind you, I think that you're probably much more serious about it than me. I play for myself.

Nic Fillingham: Oh, this is a green screen.

Joe Bialek: Oh, is it?

Nic Fillingham: No, no. These are real.

Joe Bialek: I was going to say that's a pretty convincing green screen. No. I play guitar. I don't play guitar super well, but I do it for my own personal enjoyment. I'm into cars. I have a big garage that I just got building this last year. And I've got a car lift in it, which I've wanted for many years.

Wendy Zenone: That's cool.

Joe Bialek: Yeah. I have a 1987 BMW 3 Series. That was my first car, and I still own it. And it's just become more and more of a track car over the years. So I'm currently stripping the interior out and about to put a roll cage in it and racing seats and all that sort of stuff.

Wendy Zenone: Wow.

Joe Bialek: Make it a little bit safer. Yeah. I do woodworking. I don't know. I like building things. I like doing things myself. I don't like hiring things out. I'm -- definitely live by the motto, if you want something done right, do it yourself. So that's what I aim to do.

Wendy Zenone: Are you a climber?

Joe Bialek: Yes, I'm a climber. I do bouldering. I'm terrified of heights, but I like rock climbing.

Nic Fillingham: I only learned about bouldering recently from a buddy of mine, Brian. And it blew my mind. You can rock climb, but you're only four feet above the ground. This is fantastic.

Joe Bialek: Yeah. You can mostly just sit around and hang out with your friends. And then every couple of minutes you go and work on a problem. It's super fun.

Nic Fillingham: Before we let you go, as a guitarist, fellow guitarist, I have to ask. Do you have a dream guitar that you would want, you would one day love, or have you procured your dream guitar?

Joe Bialek: I have a 12-string acoustic guitar that I think is super awesome. As far as my hopes and dreams with guitar, I think that all of my three guitars -- I have a six-string acoustic, I have a 12-string acoustic, and I have a Epiphone Les Paul. All of these guitars are substantially more capable than I am. Really, I need to get better before I start thinking about new dream guitars because they out-class me handedly.

Nic Fillingham: Oh, gosh. I wish -- if I applied that rule to my instrument collection, I wouldn't have any. We'll have to do another follow-up episode on the process of your car overhaul, your guitars, your carpentry. But I think we are coming up on time.

Joe Bialek: Yep.

Wendy Zenone: Before we go, is there somewhere that folks can follow, find you online, read about any updates that you're working on, or any other projects or presentations, conferences?

Joe Bialek: I do have a Bluesky account that maybe I'll start posting more stuff on. It's JosephBialek.bsky.social. It's my first and last name.

Wendy Zenone: Awesome.

Nic Fillingham: Awesome. Joe Bialek, thank you so much for joining us on the BlueHat Podcast. Thank you so much for presenting at BlueHat 2024. And hopefully we will get to see you and speak to you on another episode or another BlueHat conference in the future.

Joe Bialek: Yeah. Sounds good. Thanks for having me.

Wendy Zenone: Thank you. Thank you for joining us for the BlueHat Podcast.

Nic Fillingham: If you have feedback, topic requests, or questions about this episode --

Wendy Zenone: -- please email us at bluehat@microsoft.com. Or message us on Twitter @MSFTBlueHat.

Nic Fillingham: Be sure to subscribe for more conversations and insights from security researchers and responders across the industry --

Wendy Zenone: -- by visiting BlueHatpodcast.com or wherever you get your favorite podcasts.

HOST(S):

Nic Fillingham is a Senior Program Manager at Microsoft in the MSRC organization leading the BlueHat program. Originally from Australia, Nic has worked at Microsoft for almost 20 years across multiple continents, brands, and products. Nic created and co-hosted the Security Unlocked podcast and is passionate about promoting the work of security researchers and responders across the industry.

Wendy Zenone is a Senior Program Manager at Microsoft in the MSRC organization leading the STRIKE program. Wendy started her career through an all-women engineering boot camp after quitting her job while still having kids at home. She has worked at top tech companies like Facebook, Netflix, Salesforce, and now at Microsoft, focusing on various areas such as application security, bug bounty, corporate security, third-party risk management, privacy, and security training and awareness.

Schedule: Biweekly

Credits: Producer is Rob Petrillo. Production Manager is Max Solomon, and our Audio Engineer (and magician) is none other than The Great Rich Cerbini.

Creator: Microsoft

Social Media: