Using bidirectionality override characters to obscure code.
Dave Bittner: Hello, everyone, and welcome to the CyberWire's Research Saturday. I'm Dave Bittner, and this our weekly conversation with researchers and analysts tracking down threats and vulnerabilities, solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace. Thanks for joining us
Nicholas Boucher: So, going back about a year ago, we started working on a totally distinct project where we were attempting to break natural language processing systems
Dave Bittner: Joining us this week are Nicholas Boucher and Ross Anderson, both from the University of Cambridge. The research is titled, "Trojan Source: Invisible Vulnerabilities."
Nicholas Boucher: Our goal was to create adversarial examples that would cause NLP systems like toxic content classifiers and machine translation systems to break when you gave specific inputs to these systems
Dave Bittner: That's Nicholas Boucher.
Nicholas Boucher: And there had been lots of work on this in the past, but one of the criticisms we had of past work or perhaps a shortcoming of prior work, was that all of these adversarial examples – they changed the way that the text looked. That is to say that someone who is using an adversarial example against a natural language processing system would see that it has been rephrased or misspelled or something along these lines. And it was, you know, usually quite clear to the victim that they were given a poisoned example, so to speak. And we thought that we could do better than this, and we stumbled across this idea with a couple of other co-authors here at Cambridge and also in Canada, that we could change the way that strings are encoded, that text is encoded, in a way that would cause natural language processing systems to more or less fall apart and give you very poor performance
Nicholas Boucher: And once we had put out a paper on this, which is called "Bad Characters," the name of the paper, we started saying, well, gosh, we could probably use these malicious encodings to do other evil things in the various domains of computer science, and compilers and interpreters quickly became our focus. And the story here is that we realized we could use very similar techniques. We could modify the encoding, not of inputs to machine learning in this case, but of inputs to compilers and interpreters in order to cause those compilers and interpreters to output binaries, their logic, that was different than what a developer may expect. And that led us to the Trojan Source work.
Dave Bittner: Well, can you describe for us what exactly is going on here? I mean, how does this exploit work?
Nicholas Boucher: Yes. The idea is rather simple. One simply encodes source code files in a way that will render differently to a human user – someone who's using, say, a text editor on their computer – than to a compiler or an interpreter, which just ingests the raw bytes of the source code file. And there's a couple little tricks that we use to pull this off, but the primary technique is that we use bidirectionality override control characters. And these are things that exist in specifications like Unicode, for example, which is by far the most common way to encode text these days. And they exist to allow you to override the direction of a text, say, from left-to-right and change it to right-to-left. And these exist because there are many different languages in the world that use different directionality of text. And when you are writing in a multilingual setting, you may choose to write words in a way that is different than the default ordering, if you will. You may want to inject some specifically right-to-left words and to left-to-right text or change the standard order that something would be rendered
Nicholas Boucher: And what we found is that we could use these bidirectionality override control characters to change the way that text is presented on a screen – specifically, source code text. And we found that we could take these characters and we could inject them into comments and into strings inside of source code files, and when we did this, it would cause the program to the source code of the program to be displayed differently than it was actually encoded. And that ultimately leads us to the vulnerability where we craft different logic at the encoding level than we do at the visualization level. And if that logic is cleverly crafted, you could, for example, take the opposite action when a compiler sees something than when a developer sees something.
Dave Bittner: Now, Ross, one of the things that I think has captured the public's imagination with your research here is how broadly this affects things. I mean, there are many, many languages that could could fall victim to this sort of attack?
Ross Anderson: Well, this attack potentially affects almost every modern computer language, with the possible exception of Haskell. So whether you're writing in Go or Java or Python or C or C++ or C# or whatever, we've come up with examples of code that will look different to a human reviewer than it will look to a compiler. And this has led to a number of interesting effects, as we've been disclosing the vulnerability and trying to get the industry to fix it. Some of the firms to which we disclosed it said this isn't a language problem at all – this the fault of or the responsibility of the people who sell code that is used development environments. That was the attitude, for example, taken by Oracle, who has stewardship of the Java language. Other language teams, such as the Rust compiler team, for example, were very enthusiastic about fixing this problem in their language immediately.
Ross Anderson: As for the development environments, GitHub, GitLab, and Atlassian are all on the job. But it's by no means obvious that everybody is. And so, now that the vulnerability has been disclosed, there is the real risk that some bad person would target programs written in a language that hasn't been fixed, such as Java, in a company that isn't using a fixed development environment, therefore might be able to do something rather nasty. And so for that reason, we thought it prudent to get as much publicity as possible to get across to CIOs and CISOs worldwide that they'd better check the toolchain and see to it that any code that they rely on isn't vulnerable to a supply chain attack.
Dave Bittner: Now, is my understanding correct that the notion of this had come up previously in the past? I don't think anyone has dug into the depth that you all did here, but this as a possibility had been brought up before. Nicholas, is that correct?
Nicholas Boucher: There's been different ways that bidirectionality override characters have been exploited across a number of domains in the past, some of them being programming languages. So, to go through a couple of examples that we found in the wild, one major use is for obfuscating interpreted languages. So, JavaScript, for example, was typically sent to client-side users and their browsers, and a reasonable person may be able to decipher what JavaScript code is doing, and therefore companies may try to obfuscate the code and make it harder to decipher what's going on. It turns out that a couple of the JavaScript obfuscation we found online will inject these bidirectionality override characters in order to make it even harder to read text. You really would need to either strip out these characters or just looking at the raw bytes of the text to see what's going on.
Nicholas Boucher: Now, there have also been other, more malicious uses of these bidirectionality overrides in the past. So, for example, there have been use cases in smart contracts. So we see, particularly on the Ethereum blockchain, that we've seen bidirectionality override characters used to swap the arguments passed to different functions. For example, to change the sender and receiver to swap them in a particular payment. And this very interesting. We discovered this – came across this example rather late in our our work after we had assembled our paper, and it's a very, very malicious use of this particular technique
Nicholas Boucher: And there are actually a variety of other people who have proposed online, well, gosh, we could use bidirectionality, override characters in, say, comments to do precisely that – to swap the order of different arguments. And I think what we are trying to present here is a rather systematic overview of all of this, and we believe kind of inject some novel techniques that one can use for these bidirectionality override characters, particularly in the ways we propose of injecting them into strings and the ways we put them into comments – we break them into three different categories. We call them "commenting out," "stretched strings," and – gosh, what is the other group that we came up with? We put them into three categories, all the same.
Nicholas Boucher: So, and actually, going back even further before the Ethereum example, there had been even prior work. One of the most interesting ones, I think, is that bidirectionality overrides have been used to try and change the file extension or change the way that a file extension of, say, malware is displayed. So if I have some executable files, some .exe, that's sent via email and I want a user to open it and them not to suspect that it's an executable file, I could, for example, inject a right-to-left override in the file name, and I could include ".txt" or some other relatively innocuous file extension in the name, and I could use that character to swap it and make it looks like .txt is the overall extension of the file. And it turns out that this has been used to disseminate malware across email going back well more than ten years, which is perhaps a slightly different domain, but shows that within the security setting, it is certainly well-known that bidirectionality overrides can cause problems. But our goal was to present this systematic overview in the compiler setting.
Dave Bittner: And so, Ross, to what degree did your research find that this being used out there in the wild? How serious an issue is this today?
Ross Anderson: Well, thanks to a number of developments environment materials such as GitHub and and Rust, we got thousands and thousands of suspect examples of possible abuse of BD characters, which exists in public repositories, and we found that the great majority of these were just people doing careless programming which involved strings of comments in Hebrew or Arabic. We discovered a significant amount of use for obfuscating JavaScript, but we didn't find anything else of consequence. So, what appears to have happened is that up until now, various people had said, hey, you could do bad things with BD characters, and then this kind of hadn't been followed through. The people who designed BD control characters into the Unicode set put in a very quiet warning saying this might be used to do bad stuff, but nobody kind of followed through on that.
Ross Anderson: There was also some work about fifteen years ago around the possible use of strange characters in domain names. We know that Punycode is a standard for getting canonical expressions of domain names to stop this being used in phishing. In other words, there was a substantial vulnerability there, but various people had just looked at various small aspects of it, like the five blind men and the elephant. One thought this a tree and one thought this a rope and so on and so forth. And what we've basically contributed, we believe, is to firstly to trace out the whole shape of the beast, and second, to motivate the industry to roll up your sleeves and fix it.
Dave Bittner: Nicholas, you know, you bring up sort of a fascinating element of this, which is, you know, who takes responsibility for the fix here? Is it the people making the development tools? Is that the developers themselves? Is it, you know, do we go searching for these sorts of things on the endpoint after the fact? What did you all explore as far as that element goes?
Nicholas Boucher: It's a very interesting question as to whose responsibility it is to fix this vulnerability. So, oftentimes we speak in terms of expecting compilers or interpreters to put out patches to mitigate this particular attack, but that is not necessarily the only answer, and in many viewpoints, that may not even be the correct place to patch this. So, some may take the view that compilers exist to implement a particular language specification, and those following that view, for languages that have formal specifications, the place to fix this would be to add logic or add rules into a language specification, which then would later be implemented by compilers.
Nicholas Boucher: But still, others may say, well, you know, adversarial encoding is really, you know, that that might not be in the job of a compiler to defend against. That might be in, for example, a static code scanner, in which case you have a variety of different security companies that sell services or even open-source products online that will do static code scanning and potentially be able to expose attacks like this. And perhaps that is the place to prevent something like this.
Nicholas Boucher: But a still different approach that one can take is to say that this isn't perhaps even a problem with compilers, it's a problem with visualization. We have these, say, text editors or perhaps repository frontends, websites that we use to view code online that are visualizing code in a way that is misleading for what that code would actually do if it was ingested into a proper compiler or interpreter. And because of that, perhaps the answer is that we need to fix the way that text is displayed inside of text editors, and we need to add warnings and make these directionality override characters visible in online platforms.
Nicholas Boucher: And any one of these techniques is a perfectly reasonable way to defend against these attacks. But I think the important thing to keep in mind is, you know, if large-scale attacks were to be launched using these techniques, your best strategy is probably a defence-in-depth strategy where you have mitigations in place at each of these layers, because even if we, say, we're to patch all of the compilers that we know are affected, it is very likely that there are compilers that we just haven't looked at that are indeed affected, or, you know, of course, the legacy versions of compilers that hang around in certain development environments. And because of this, we would look for things like static code scanners or, you know, adjustments to visualization pipelines to be able to catch those attacks
Dave Bittner: Yeah, I mean, it strikes me that, you know, I can envision use cases for this that are actually legitimate. You know, there are times when perhaps you want to obfuscate something for a security reason, so do we eliminate that possibility here? It's really quite interesting, isn't it?
Nicholas Boucher: Well, one of the underlying issues that I think this exposes or draws attention to is that internationalized text is something that is just an inherently challenging problem in computer science. And I think that we have these systems like Unicode, which do a great job of providing very thorough support for a very large number of languages, but there are security issues that arise from these platforms. But you know, what is the answer to that? Certainly, we can't say that everyone needs to use ASCII for everything, that, you know, non-English languages have been disadvantaged in many computing contexts for a very long time, and certainly our solution can't be to regress and say that everyone needs to use, you know, a small number of Latin characters in all of their writing. But at the same time, that means that if we are to use these powerful, internationalized text standards like Unicode, for example, there are these nuances that are very important for all parts of the development pipelines to take into account, lest security vulnerabilities arise.
Dave Bittner: Ross, what are the next steps here? I mean, in your research, you point out that you all went through, you know, proper responsible disclosure to the various developers of the tools that are involved with these languages. Where do you hope your research leads?
Ross Anderson: Well, first, we think there may be other similar vulnerabilities that arise out of the insane amount of complexity that has arisen around modern development environments. And so, we leave that as an open challenge to everybody to look for other stuff that was put in to be helpful. There's no hiding unpleasant stuff under pretty stones. The second thing that we're going to write up is the enormous diversity of the response that we got to coordinated disclosure, because one of the really important things for information security is the rate at which vulnerabilities are fixed once they get disclosed. Because if they don't get fixed quickly, then lots of systems end up being vulnerable. And we discovered that there was a very broad range of responses in the industry to our disclosure. That disclosure was somewhat off the beaten track, because we weren't in a position of saying, you know, hey guys, here's a zero-day vulnerability that allows me to take remote control of one of your systems without human intervention. We are saying here is a vulnerability that allows a bad person to smuggle code into your system, perhaps through the supply chain, in such a way that humans won't notice it. And that's altogether more difficult to deal with.
Ross Anderson: Now, as time goes on, we'll have more and more vulnerabilities that are more conceptually difficult to deal with and the industrialized processes that a number of the Big Tech firms and others have are going to be less and less able to cope. And it's particularly interesting to see the relatively poor performance of some of the Big Tech companies who had outsourced the vulnerability disclosure process because if you've hired a subcontractor and told them, you know, you will pay the following amounts dollars X for bugs of the type Y, and we will pay you so many dollars a month to run the service for us. And of course, the responders don't have an incentive to put any efforts into anything that's even slightly out of the ordinary. So as a result, you may find that a number of companies have got the appearance of a disclosure system, but without really the reality.
Dave Bittner: Nicholas, any thoughts from you there on where you're hoping this leads?
Nicholas Boucher: My hope is that as many compilers and interpreters as possible be patched against this particular vulnerability, and in addition to that, that we continue to see changes in code visualization pipelines and perhaps even notes added to the Unicode standard to very clearly and explicitly say that, you know, this attack pattern is something that we need to watch out for. I think in the bigger picture, what worries me is less the individual developer adversaries that want to exploit something like this, but, you know, perhaps some of the more powerful, advanced persistent threats, if you will. You could imagine that should someone have an insider, control of an insider at a particular company or project, or simply has lots of time and opportunity to try lots of different techniques. If they are able to inject a particular backdoor that goes unnoticed, a particular vulnerability, well, you know, we could find ourselves in a situation perhaps similar to some of the supply chain attacks that we've seen in recent months – SolarWinds not that long ago – and seeing something play out where it might not be immediately clear where the vulnerability is in code or how some backdoor got in place, or of course, you know, if that bug happens to be ingested into a compilers source code itself, you know, we could find ourselves with untrustworthy compilers floating around and it not being immediately clear where these vulnerabilities are. And it's those, you know, slightly more insidious, slightly more difficult-to-plan attack vectors that, to me, this represents one of the most scary threats of the Trojan Source work.
Dave Bittner: Our thanks to Nicholas Boucher and Ross Anderson from the University of Cambridge for joining us. The research is titled, "Trojan Source: Invisible Vulnerabilities." We'll have a link in the show notes.
Dave Bittner: The CyberWire Research Saturday is proudly produced in Maryland out of the startup studios of DataTribe, where they're co-building the next generation of cybersecurity teams and technologies. Our amazing CyberWire team is Elliott Peltzman, Tre Hester, Brandon Karpf, Puru Prakash, Justin Sabie, Tim Nodar, Joe Carrigan, Carole Theriault, Ben Yelin, Nick Veliky, Gina Johnson, Bennett Moe, Chris Russell, John Petrik, Jennifer Eiben, Rick Howard, Peter Kilpe, and I'm Dave Bittner. Thanks for listening. We'll see you back here next week.