Research Saturday 10.2.21
Ep 203 | 10.2.21

IoT security and the need for randomness.


Dave Bittner: Hello everyone, and welcome to the CyberWire's Research Saturday. I'm Dave Bittner, and this is our weekly conversation with researchers and analysts tracking down threats and vulnerabilities, solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace. Thanks for joining us.

Dan Petro: I guess to start, we do just pronounce it, "You're doing it wrong." At least when spoken, that seems to be the best way to say it.

Dave Bittner: That's Dan Petro. He's joined by his colleague Allan Cecil from Bishop Fox on their research, "You're doing IoT RNG."

Dan Petro: Yes, this whole thing came about from an engagement that we were once on. So, we do a bunch of what we refer to as "product security reviews" at Bishop Fox that are – we'd like to say that if it breaks when you drop it, then it's a product security review. A lot of hardware sort of engagements there. And so there was an IoT engagement from one of our clients that was making a kind of like a security device that did a lot of cryptography as a part of its normal operations.

Dan Petro: And so, when doing that engagement, we sort of asked the client, like, hey, what are you using as a random number generator? Since it's doing all this crypto, you know, how do you generate keys, things like that? And they replied that like, oh, well, the SoC that were using – the system-on-a-chip – has a built in hardware random number generator on the board. That sounds great. And just kind of on a lark we asked them, like, do you mind giving us a bunch of the output from the random number generator? Just tell it to produce a gigabyte of data and just send it over to us. We figured it'd be fine. Like, a hardware random number generator, surely that's the gold standard for generating random numbers. It's a peripheral that does nothing but this one thing, right?

Dan Petro: So, we got the file back, and to our horror, large swathes of it were just zero. We thought that something surely must be terribly wrong here. There must be some buggy code or, like, you know, what happened here that would cause this? And so the research was kind of a follow-on from that to look up, was this just a one-off thing or was it wider than that? It turned out that it was a much wider issue than, you know, just one chip.

Dave Bittner: Yeah, it really is fascinating the way you all dig into this and unpack the sort of, you know, the mystery of how something like this could happen, and I think to your point here, it's sort of a head-scratcher that something like this could happen. Allan, can you give us some of the background here? How do we get to the point where a piece of hardware that's dedicated to generating random numbers is not doing that?

Allan Cecil: That's a really good question. And what's interesting is it's not just the hardware that things are going askew at, but we'll get into that in a bit. The hardware itself generally works in a couple of specific ways, and computers are really, really good at deterministic behavior, because if you did your taxes, and every time you ran the program, you got a different result, you'd be pretty upset. So you want computers to be extremely deterministic, which is fantastic. But on the other hand, it's a real pain in the rear when you need things to not be deterministic, when you need randomness.

Allan Cecil: So, there's a variety of ways of getting randomness. One of them is to use a pseudo random number generator seeded from something – either the time of day, input from a user, or other factors. But ultimately a pseudo random number generator is not going to be perfect, so a lot of times you want to have some other source of entropy that's a little more random. And the hardware devices that do this, especially in the IoT world, are relatively simplistic to understand. For instance, you might have one that relies on an analog NOT gate, one that's unclocked, and at any time you're going to ask it, are you a one or a zero? And it'll respond relatively randomly.

Allan Cecil: The other way to do it is to have two clocks running at different speeds or slightly different speeds and sample the delta between them. And they'll never necessarily be exactly the same, although they could be. But for the most part, you're generally going to get certainty that you're going to get a random zero or a random one out of that.

Allan Cecil: The big issue with both of these designs is they can only give you so many numbers at a time. If you're asking for a lot of randomness at once, you might have to wait a little bit. There's only so many random numbers that it can give you at once before it exhausts itself and you have to refill that pool. And that's where a lot of things start to go kind of sideways.

Dave Bittner: Hmm. Well, take us through what's going on here then. I mean, it's a combination of things. As you say, it's not just the hardware – it's the way people are using it. Can you walk us through the problem?

Allan Cecil: So, the problem is at multiple levels. One of the things that we discovered in our research is, first of all, the hardware random number generators themselves, even when you use them properly, sometimes aren't giving you truly random distribution of numbers. So, there's that issue. Next step up, you're probably not going to be writing everything from scratch. You're probably going to be using a library. And what we discovered is that some of these libraries had some pretty serious issues, which we'll get into in a second. The next level up is if you're a user, you're probably going to try to use example code. And in some cases, the example code itself didn't work correctly or used bad paradigms – for instance, not checking error codes, or in some cases, misleading the user drastically. Or in other cases, you'd have to read a one-thousand page manual in order to know exactly how to properly call the hardware RNG. So there's all kinds of missteps that someone trying to use a random number generator on an IoT device can go terribly wrong.

Dave Bittner: Yeah, it is fascinating, and it seems like almost a cascading set of possibilities here, where you would think something that on its surface would be as simple as calling – requesting a random number from your system, there wouldn't be as many things along the way that could go wrong. Can we dig into this thing about not checking error codes? I mean, what exactly is that about?

Dan Petro: Yeah, I can field that one. So, this kind of gets to the heart of the name of our presentation. We called it, "You're doing it wrong" specifically because there's maybe not one right way of doing it, but there's definitely many wrong ways of doing it. And this is sort of our way of standing up in front of the IoT industry the best that we could and telling, you know, an entire industry of technology that they're doing it wrong.

Dan Petro: And so, the way that this has been solved in basically every other field is through a cryptographically secure pseudo random number generator subsystem. So, if you are on a server farm, and you need a random number in your Linux-like process, there's an API for this. You can ask Linux, hey, Linux, please give me a random number, I need it to make an encryption key. And it could do that for you securely. We've spent lots of time, we had lots of smart people look at these algorithms, this whole process for doing it, and we've managed to figure this out.

Dan Petro: But this, unfortunately, is not how things work in the IoT space. When you're in an IoT device, you basically just call the hardware random number generator itself, like, you talk to the peripheral. And just like any other piece of hardware, it can potentially fail, because you're not, you know, interfacing with a piece of like a software subsystem – it's an actual piece of hardware, right? Any number of things could have gone wrong. It might be overheating. Maybe the bus got scratched. Jupiter and Saturn just weren't aligned at the time for all we know. Hardware devices can return error codes.

Dan Petro: So the very first thing we looked at was how many people in the real world are checking the error code of these hardware random number generators? And it turns out almost nobody that actually checks these error codes in the wild just by, you know, doing a cursory glance at code available on GitHub. And this doesn't come as any great surprise, being that, like, you know, it's actually really hard to do this properly. This turns out to be a whole can of worms by itself. So there's the kind of level one understanding of this, which is that you're sort of left to your own devices in the IoT space in terms of, you know, talking to the hardware and doing things properly. And most developers wind up doing the easy thing and not the secure thing. But that's not where the story ends. That's really just where the story begins.

Dave Bittner: Before we go any further, just for my own, you know, understanding here to make sure I'm following along properly. So, is part of the issue here that on an IoT device, basically, you don't have an operating system as an intermediary level to to get in between you and the hardware to make sure you're getting what you need?

Allan Cecil: Yeah, basically, there are middleware or, like, you could call them IoT operating system type, things like Kon-Tiki or FreeRTOS or other things like that. But they don't currently have a subsystem for CSPRNG – they don't have a cryptographically secure pseudo random number generator setup where the you as a user can call that, and instead you as a user are forced to directly call the hardware, which often goes quite sideways.

Dave Bittner: So, if I'm calling out to the hardware and I'm requesting random numbers in the hardware is failing, what happens next? What's coming back to my IoT device?

Allan Cecil: One of the things that's interesting about calling random numbers is a lot of times when you need a random number, you don't need just one. You're generally going to be asking for a lot of randomness all at once. For instance, you might be generating an RSA 2048 bit key, for instance. So you're going to need a lot of entropy, a lot of random numbers all at once to make that key. So you're going to call it, and then call it again, because you're only going to get one bit at a time, you're going to have to make a lot of calls to get the number of bits you need. The challenge is, if you call too frequently, you will exhaust the hardware random number generator device's pool of entropy, or pool of random numbers that can hand you, and it will error. Now there is a way to ask the hardware, hey, have you errored because of whatever circumstance. The problem that we ran into repeatedly is no one checks the error codes. And when you dive into why, it gets kind of interesting. I'm going to let Dan talk about why people can't check the error codes.

Dan Petro: Yeah, so, the trouble with kind of placing the blame squarely on the user here is that they're just placed into an impossible scenario. So if you imagine trying to write, like, a networking stack on an IoT device using one of these hardware random number generators. So you're, you know, in the TLS stack and you need a crypto key, you need to generate some numbers to make an encryption key to talk to somebody externally, and you call the RNG, the hardware RNG function, and it comes back with an error code. Like, what are you supposed to do with that? One of the things about random numbers is that when you need them, they're sort of critical to the core concept of the thing that you need to do here. You can't just simply handle the error in some abstract way and then move forward without that random number. What does it mean to do TLS without a random number? Like, you just kind of can't. 

Dan Petro: So you're left with really two possibilities here. One is to block – that is, to just halt the entire machine, and most manufacturers will instruct you to just call the RNG again a second time, basically spin looping at one-hundred percent CPU, waiting over and over again for the RNG to be ready – which, in the case of a broken device or a damaged device, it might never return. You might just spin loop forever. That's not a great option. But the second option would be just to quit out. This is to kill the process, stop the networking process. And that's not acceptable, either. Like, are you really supposed to just kill the entire process every time you run out of entropy, which happens quite frequently? Like, that's not a workable scenario, either. So you're sort of left between these two terrible choices. And it's no surprise that developers wind up going with just, well, let's just ignore the error code because things work when I do that.

Dave Bittner: Mm-hmm. Now, if I'm a developer and I'm not checking for the error code, is it likely that what's coming back to me looks random enough that I won't necessarily know that it's not truly random?

Allan Cecil: That's one of the problems. A bunch of zeros in a row can be a legitimate answer. It is random, after all, so you could randomly get a whole lot of zeros in a row. So when you glance at it, sometimes it's not immediately obvious. Now, one of the things that was obvious in the data set that Dan looked at was there was a large swath of zeros all in a row, but it's not necessarily always that obvious. Sometimes what you'll get is repeating patterns. Maybe you'll get three zeros every fifty bytes, which is what one of the devices we looked at did consistently for some unknown reason. And we would love to know why it did that. But if you were very casually looking at the data, you wouldn't necessarily notice it unless you were analyzing it very carefully. Even when we went into the trouble of doing statistical analysis, sometimes you had to use the right statistical analysis tests to find the problem. So even if you were glancing at it with statistical analysis tools, sometimes even that wasn't enough to detect, oh, wait a minute, there's actually a problem here.

Dan Petro: Yeah, and that kind of gets into, like, how do we actually evaluate the hardware RNGs, because that kind of comes down to a fundamental problem of how do you know randomness when you see it?

Dave Bittner: Mm-hmm.

Dan Petro: At the risk of making this overly philosophical, you do have a really hard issue that you expect a random number, but then if you try to really dig deep down into that question, like, what on earth do you mean by a random number? You start to realize that you're asking for your security device to break the laws of physics. You want it to come up with some number that's not based on anything. You want it to break the laws of causality to create some number that isn't based on anything else. And like then you start talking about quantum mechanics and the whole thing becomes garbage.

Dave Bittner: (Laughs) Right, right.

Dan Petro: The important thing here is actually to not be so overly concerned about whether the number is random or not in some abstract sense, whatever that even means, but rather whether it's predictable or not. And now we can actually talk about it in terms of an adversary who has certain amounts of information and doesn't, and has certain levels of access and doesn't, and is or is not able to predict certain numbers with given inaccuracies. Now, that's actually a problem that we can wrap our heads around.

Dan Petro: So, there's lots of good statistical analysis tools that you can analyze to see how predictable certain numbers are, that do rather interesting things. One of the common ones out there is DieHarder, is a series of tools that have been out for a couple of decades or something like that. You give it a long string of numbers and it'll do things like play games of poker and craps and things like that with the numbers, like basically transforming them into die rolls, poker cards, and then seeing if they line up to the expected distributions that you'd expect from those particular games, in addition to lots of other kinds of statistical randomness tests.

Dave Bittner: Would many people who are putting together systems like this find that, even in its imperfection, that the numbers they're getting back are random enough for their use case?

Dan Petro: It drastically depends on the use case. So, the thing about symmetric crypto keys, so like if you're trying to communicate with somebody and you have an AES key – that's like a 256-bit AES key, right? – and it turns out that you're using a really trash RNG and half the numbers are zero, you're still left with a 128-bit AES key, and that's still strong. You're not going to crack a 128-bit AES key. They're very resilient to those kinds of things, the loss of entropy. That's not necessarily true of other operations. Many different kinds of encryption – in particular, asymmetric cryptography like RSA – uses math as its base operations, not just simple algorithms. And so, they can be much more susceptible to low-entropy states and certain sections of numbers being zero. In fact, there was another talk at DEFCON this year called "The mechanics of compromising low entropy RSA keys." So, like, this is a thing that's a real-world threat.

Allan Cecil: What we saw in our data matched what researchers in 2019 saw – that there were millions of low entropy keys on the internet that they found in their research, and they couldn't exactly determine where they came from. But they theorized that maybe it was coming from IoT devices that had very poor hardware RNG devices, with low-entropy key generation.

Dave Bittner: So, what are some of the possible solutions here? I mean, if we have a – this is a broad issue, right? I mean, we're talking about something affecting many, many IoT devices, millions, potentially billions of devices out there. Is there a solution?

Allan Cecil: I'm going to steal in my own law, as it were. My law is don't attempt to write RNG code on IoT devices on your own. It's as difficult as trying to write crypto code on your own. No one goes out and tries to write crypto code and gets it right the first time. And you're going to have the same problem with RNG code on IoT devices. Don't try to do it on your own. Instead, you should be using some kind of CSPRNG subsystem. Unfortunately for end-users, that doesn't readily exist right now. So for an end-user, someone who maybe has a smart door lock that perhaps is using the same password as a lot of other smart door locks, I recommend updating quickly. Whenever you see your vendor supplying an update, you should probably be applying that. 

Allan Cecil: For developers, you're going to have to put pressure on the hardware manufacturers and the people who are making these operating systems for these IoT devices to implement a CSPRNG subsystem. Really, right now, if you have to do it on your own, read the manual extraordinarily carefully. You're going to find weird cases where you might have to call thirty-two times in a row, get a number, throw out the next thirty-two calls, get a number, and repeat. So you're going to have to be very, very diligent and double check all of your work. It's very difficult right now.

Dave Bittner: Yeah, it's an interesting, fascinating part of your research here, was what you were just describing. That the instructions in the manual are not intuitive, and you could see how many people could overlook that and and think that they're getting random numbers when they're not.

Allan Cecil: If I saw that example in code I was trying to work with, I would think it was a bug. I'd think it was a completely mistaken bit of implementation, because who would do that? Who would read and then throw out the next thirty-two numbers? But that's what the manual tells you to do. That was the LPC device, correct?

Dave Bittner: Yeah, yeah. There's another example here that was fascinating. You were looking at the MediaTek 7697, which is, I believe, a system-on-a-chip. And you did some statistical analysis of the random number generator and you ended up with a sawtooth pattern.

Dan Petro: Yeah, the interesting sawtooth pattern on the MediaTek device was always very curious. That was one of our initial devices that we looked into, in fact, as well. And so, that one started leading us down a path of wanting to make our own statistical tests as well, since the very first thing we noticed was, you know, we took these numbers, put them into the existing statistical tests like DieHarder or there's a tool called ByteCircle, and it fails all these tests. But they just kind of tell you, like, pass-fail based off of, you know, existing information to say, like, well, we tried playing a thousand games of craps with this number and with these numbers, and it didn't work – they came out with like bad results. But that doesn't actually tell you what the heck is going wrong, right?

Dan Petro: And so we tried as much as possible to make like pretty charts and graphs and things. And so we postulated, well, what if we graph every byte, zero to 255, and see if every byte happens the same. Because depending on how the actual hardware works, they might create random bits at the byte level, at the bit level, or even at the word level, like 32-bit words. And so that's all just dependent on the hardware, where if the hardware might make a single bit and then concatenate it with another single bit. And so basically every bit is independent of each other – at least it should be. And some devices create 32 bits, all at the same time. Basically, it's – you don't have one random bit generator, you have 32-bit generators all kind of concatenated together. And so, who knows? Maybe there is some correlations there.

Dan Petro: So, we basically plot it out on a histogram, and lo and behold, you get this interesting pattern of bytes where some were clearly happening more and less often than others. And it seems like that was likely due to a bit bias where, like, zero was more likely to occur than one was across the distribution. And it kind of creates that pattern when looking at it in terms of bytes. And so that kind of bias is exactly the sort of thing that you would not like to see from a thing that you're, you know, basing your cryptography on. I posit to the audience there, like, you know, you look at that graph, you say, how confident do you feel using this directly for your crypto keys? Like, even if you did do all the software correct, even if you check the error codes and you went through all that process, you're still kind of getting this number, this pattern that would keep cryptographers up at night.

Dave Bittner: To what degree, I mean, to what volume are we ringing the alarm here, how how serious in the practical real world is this potentially going to affect things?

Allan Cecil: I would say that this affects 35 billion, potentially 35 billion IoT devices out there today. On the one hand, that's an alarming number. On the other hand, this isn't a Heartbleed type attack where every device is immediately at risk. It's much more device-specific and application-specific how you're using those random numbers. So, on the one hand, it's going to take a bit of – how do I say it –tinkering? It's going to take a little bit of specialized work to pull off a particular attack against a specific device. But on the other hand, there's a lot of devices out there that are vulnerable. I'll let Dan expand on that a little bit, too.

Dan Petro: Yeah. We're not used to seeing attacks or vulnerabilities that affect an entire industry in security. Generally, we'll, you know, see somebody wrote some buggy code, there's some library, there's some software, and sometimes it's, you know, particularly bad because a lot of people depend on this piece of software and then, you know, people will fix it. Or maybe they have to, you know, individual users have to patch it on their own. And we're kind of used to this kind of, you know, rinse-and-repeat process in security. What we're not used to seeing is something come out where an entire industry, where the problem is the status quo. It's a programming pattern. It's the way that an entire industry does things. 

Dan Petro: Like, can you imagine if the automotive industry just fundamentally doesn't do bounds checking. It would be like – and then every time you ask them why they do it this way, they give you some excuse about, you know, they have strict overhead requirements and they don't have the time to – like, no, like, those are all bad arguments like you absolutely have the time and overhead to do this properly. These are important devices. IoT devices are not toys anymore. They're home security devices, they're things you put your body into. There's things that you put into your body that are IoT devices. Like, this stuff is important. We can definitely do it the right way.

Dan Petro: So, that's kind of like the first like level of that. Because we're talking about an entire industry here, remediation is tricky. The IoT industry is not one thing. It doesn't use one library. It doesn't use one piece of hardware. It's very heterogeneous. The good news is that this can be patched in software. The bad news is it must be patched in software, and that IoT devices are notoriously difficult to patch. Many of them are burnt firmware onto devices and have no update capability. Some of them have the capability of updating, but it's not simple to, and many of them do, in fact, have, you know, pretty low computing power, to where you have to at least give it some design consideration to how to solve this sort of problem.

Dave Bittner: I'm curious, you know, as the two of you were making your way through this research, what was it like to realize the scope of what was going on here? I mean, did you have "aha" moments along the way where you kind of looked at each other and said, holy smokes, this just keeps getting bigger?

Allan Cecil: For me, it was more along the lines of, am I doing this wrong? Like, really, seriously, am I properly – I think I'm following all the steps correctly. I took on tackling the STM32 and spent a considerable amount of time doing it incorrectly, not intentionally. I spent a lot of effort trying to make sure I was spin looping properly to make sure I was getting proper random numbers, implemented it incorrectly on accident, didn't realize it, and went down a track of thinking, wow, this device is really producing absolutely garbage random numbers, not understanding what could possibly be going wrong. Discovered a flaw, tried it again, found a different issue, tried it again. We were actively verifying it every step of the way with hard-line, really thorough statistical tests with DieHarder that we were doing it properly. No intern at a third-party firm who's been brought on to help with a late project is going to take that effort. So the fact that we spent measured effort at getting this right and couldn't do it is terrifying. It's just really hard to get right.

Dan Petro: Yeah, they say that things only occur in increments of amounts of zero, one, or infinity. So it's possible that we would look into this problem and there'd be no instances of this vulnerability, or it's possible that we could look into it and there'd be one buggy device out there, right? Or it's possible that they're all buggy. But having exactly two buggy devices would be – that'd be weird and unheard of. So it's kind of like, we knew that there was one instance of this already going into it because we kind of already found that. So once we took out a second IoT device and looked at it and the exact same problem was there too, that was like the major "aha" moment. That was the major breakthrough of, oh crap, I think this is everything. Just because two devices by completely different manufacturers that use entirely different software stacks that have completely different devices built on them have exactly the same issues. Like, now we suddenly realized this is actually everywhere, that this is because of how the industry uses them.

Dave Bittner: Our thanks to Dan Petro and Allan Cecil from Bishop Fox for joining us. The research is titled, "You're doing IoT RNG." We'll have a link in the show notes.

Dave Bittner: The CyberWire Research Saturday is proudly produced in Maryland out of the startup studios of DataTribe, where they're co-building the next generation of cybersecurity teams and technologies. Our amazing CyberWire team is Elliott Peltzman, Tre Hester, Brandon Karpf, Puru Prakash, Justin Sabie, Tim Nodar, Joe Carrigan, Carole Theriault, Ben Yelin, Nick Veliky, Gina Johnson, Bennett Moe, Chris Russell, John Petrik, Jennifer Eiben, Rick Howard, Peter Kilpe, and I'm Dave Bittner. Thanks for listening. We'll see you back here next week.