the cyberwire logoDec 9, 2019
evm: ALLSTAR: New Challenge Problems for Static Analysis

evm speaking at the Jailbreak Brewing Company Security Summit on Friday, October 11, 2019. 

Some of the hard research problems in binary static analysis have been reduced to practice (esp. decompilation and function matching). This has made RE incrementally easier, but it remains a challenging, time-consuming, laborious task. Can we build off our existing tools to make the process more streamlined and approachable to the novice? We'll describe new challenge problems for RE research as well as a large public dataset we built to start working on these problems.

(Source: Jailbreak Brewing Company)

Transcript

Male Delegate: [00:00:38:14] So, let's dive right in. So, imagine a world where software reverse engineering is faster, more effective and acceptable to a broad variety of people. Imagine you have a tool that produces meaningful, descriptive labels in your disassembly of a completely new binary. Imagine, when you're working on a new binary, you get an automatically-- you automatically get a well-labeled software architecture diagram, with a description of each module at the start of your project? Imagine that you can hand people, with little or no reverse engineering training, an automated summary with a description of a new binary.

Male Delegate: [00:01:29:23] Today, for the low, low price of 29.95. [LAUGHS] I couldn't say that without it sounding like a sales pitch; I'm, I'm sorry. It's definitely not. It's-- it much-- It's worse than that, this is an academic asking you obtuse, philosophical questions at 9:30 in the morning. But, what I'm going to be talking about today is a concept that I call-- it's a concept I call Jarvis for Code; so it's the idea of, what are, what are we going to do as the next step in reverse engineering research, to make reverse engineering faster, more effective and accessible to more people?

Male Delegate: [00:02:16:17] So, I'm going to talk about the, the, the past couple of years in software reverse engineering research, and talk about what problems we should considered solved. I'm going to talk about the, the gaps between the, the solved problems, our processes now, and kind of what tools we need to fill the gaps; hopefully motivating us on what problems to work on next. And talk a little bit about how we get started on that.

Male Delegate: [00:02:46:09] This is definitely a-- this is an argument that I'm trying, trying to make, and, so, you can feel free to agree, or disagree with, you know, with any part of that. Hopefully, I'll get through in enough time and would, would love your thoughts and, and feedback about, you know, sort of, each part of this thing.

Male Delegate: [00:03:10:08] Just a quick detour on, on my background and, kind of, our, our background at, at APL. We do vulnerability research on embedded systems. So, we're-- usually, sponsors, they, they bring us devices, equipment, systems and they're asking us, you know, a, a common use case is, "We've got this, this, this piece of equipment, this system, it's going to be deployed on a, on a boat." The, the best one is when they come to us and they say, "It's already, it's already deployed on the boat, does it have vulnerabilities in it?" And, so, we get to, you know, tear it apart, tell them, you know, how it, how it works and what the vulnerable pieces are.

Male Delegate: [00:04:05:10] We're, we're often working in, sort of the-- so, so we're working with, like, these, sort of, combination systems; a lot of times there is a PC, or a mobile device as kind of the, the central control piece. But, you know, lots of things that look like they should be one device, and you take off the, the-- take off the cover and it's got, you know, might have multiple commercial boards, multiple contractor design boards, and we, we get to pull it apart, figure out how it works and then analyze it for vulnerabilities. So, we're often working in that bare metal RTOS environment, where it, it-- which, which is just a little bit more, you know, kind of, difficult than if you're doing application level reverse engineering. It takes us longer to understand the, the interfaces between operating systems, library code and then, you know, find that spot that we're going to target for vulnerability analysis.

Male Delegate: [00:05:10:08] In our world, dynamic analysis is, is rarely possible; sometimes it's possible on the, you know, on PC platforms that are ho-- hooked up to the system. But a lot of times we're working with these, you know, these embedded processors, where we can't get, you know, we can't get JTAG access or debugger access; or, if we can, you know, maybe we can use that to pull firmware, but we can't really use it to get processor traces. So, we tend to focus less on the dynamic analysis piece.

Male Delegate: [00:05:47:01] So, that's just a little bit about, that's just a little bit about our background. I think that gives us a view into some of the most difficult challenges in software reverse engineering and static analysis. If your-- you know, if that's not your area, I hope that you still get something out of this talk. I think there's, hopefully, something here for other kinds of vulnerability analysts and malware analysis. But, that's just sort of my background.

Male Delegate: [00:06:22:15] So, at this point, you know, we've, we've been doing reverse engineering for a really long time. Talking-- I got, got to meet a few folks last night, and talking to folks this morning who have been doing, doing this stuff for decades now. So, you know, just kind of taking a minute to look back, what should we consider, you know, a solved problem?

Male Delegate: [00:06:50:00] So, I'm going to make two arguments on this, this slide, and I'm going to, going to make the argument that, that these,that these three things are where the research has really focused; in terms of the, the practical-- the research that has, has actually impacted our, our day-to-day lives in reverse, reverse engineering. And, so, I'm going to say that the, the, you know, over the last couple of decades, the main useful research that's, that's come out in reverse engineering is decompilation, function to function matching, and then combined static and dynamic analysis approaches.

Male Delegate: [00:07:33:04] I'm going to make the argument that, I think we should consider decompilation and function to function matching a solved problem. In, in, in, in terms of-- and I'm going to go into those in, in the next couple of slides.

Male Delegate: [00:07:50:06] In, in terms of static and dynamic approaches, a-- as, as I mentioned, si-- since I-- there's, there's been a lot of research in that field, a lot of, a lot of work there. I don't, I don't feel qualified to evaluate the state of the tool sets there; you know, those are, those are things like, tools like, like, Frida, or, or other tools like that, where, where, you know, you can combine, you can combine trace output with your static analysis for-- both for, you know, evaluating code flow, or evaluating-- figuring out data types. But, since I don't do that a lot, I'm not really qualified to evaluate the state of those tool sets; so I'm going to just kind of punt on that one.

Male Delegate: [00:08:49:19] This is just a, a, a slide with a lot of, kind of, my research into where the, where, where the research-- how the research unfolded, sort of, in, in the decompilation space. And I, I have all the references here at the end. I'll post them up after the talk, so that you can look them up and check them out; if you want to get into the, the weeds behind how decompilation works.

Male Delegate: [00:09:21:14] So, it tu-- turns out that the-- and I've, I've got the-- I've got the GHIDRA team, or a couple of folks from the GHIDRA team here today, so I could probably pick their, their brains. They're the one folks I, I didn't get to see what the-- how, how decompilation unfolded behind the scenes there. But, in the, in the public space, everything goes back to this, this doctoral thesis by Cristina Cifuentes. And she was at, at the University of Queensland, I believe. And, back in 1994, she sort of out-- outlined the, the whole process for decompilation, and that, that process is what every decompiler, that's out there today, follows. 

Male Delegate: [00:10:05:18] So, you know, you're probably familiar with Hex-Rays and GHIDRA. There's all-- the, the other ones that are out there, that are more popular, are RetDec and, and JEB. I think, I learned, this week, that there's a, there's a Trail of Bits, one that works on LOV. There's a Trail of Bits people here? No Trail of Bits people. Okay. Didn't want to leave them out. And there, there's another one that works off of the, the VEX intermediate representation.

Male Delegate: [00:10:43:10] But, in general, the way that decompilers work is, it's the compilation process, you know, in reverse. So, they do some analysis on the code and then they convert the binary code to an intermediate representation. So, for GHIDRA, GHIDRA uses p-code; and using the SLEIGH model. IDA has a thing they just call Microcode. RetDec and the, the Trail of Bits relic works off of the LLVM intermediate representation. And, so the idea is, you convert to the intermediate representation, then you maybe do some passes on that to make it simple. You convert it to your abstract syntax tree, or your abstract representation, and then you generate code.

Male Delegate: [00:11:35:13] So, my, my argument-- the reason we should consider this solved is that it-- you know, these things work pretty well for our day-to-day use, you know. I would sort of argue that, both the Hex-Rays decompiler and the, you know, GHIDRA decompiler are, are useful, they work well. They're not perfect, but, you know, they, they, they work well for our day-to-day use.

Male Delegate: [00:12:02:18] The thing about decompiler is, is they're always going to produce blank code, right? And, so, the argument is that, that'd I'd like to make is that, we can continue, and we should continue, to improve decompilation accuracy, and it would be great if, if, you know, we continue to do that. But, making inter-- incremental improvements in the speed of decompilation isn't going to lead to fundamental changes in the speed of, of effect-- speed and effectiveness of how we do reverse engineering.

Male Delegate: [00:12:37:04] Function matching, kind of, likewise. I'll go through this pretty, pretty quick. And, I'd, I'd, I'd sort of consider, again, funct-- function matching a solved problem. BinDiff and Diaphora are the big public tools that, that do this. There's, there's a similar process that goes on in GHIDRA's version matching tool. There's some other things out there, like Camino, and-- I just got their, their Lumina thing now thing now; that's basically doing, kind of, Cloud based, you know, with a back-end function matcher.

Male Delegate: [00:13:21:08] These things, they, they, they generally sort of work the same way. They work with-- you, you take a function, you take your basic blocks, you put it through a, a compression function; so, something that sort of generates a fingerprint for the basic block, and you do a, a graph based comparison, you know, from one function to the other. And sometimes you can zoom out and look at the, the overall code flow graph and compare the context of one function to the context of the other one.

Male Delegate: [00:13:52:07] And, again, I would argue that these work pretty well in most situations where we can identify-- they work well in situations where you can identify that there's a given library in a, in a piece of code and you can-- you have the, sort of, the static library that was used in that code. Again, I would argue that incremental improvements here, in functioning matching, are not really going to lead to fundamental changes in the speed and effectiveness of already.

Male Delegate: [00:14:23:15] Okay. So, leave the past in the past and talk about, you know-- so, currently, what, what are the gaps-- so, we've got those solved problems. We've got those tool sets. What's the gaps between what we have now-- the tools that, tools that we have now, and the processes that we, that we use?

Male Delegate: [00:14:46:09] So, I'm-- [LAUGHS] I'm constantly-- and, and, I, and I apologize for asking this, kind of, deep philosophical question at, at nine in the morning. I was sort of hoping to have this discussion after everybody had a few beers, and we could talk about philosophical concepts, but we'll, we'll do our best. When you talk to, talk to folks about software reverse engineering, if-- you know, what do you, what do you say when you explain to them what you do for a living; if you're a software reverse engineer? And, I would, I would say, generally, when you talk to people, people are like, they say something-- you'd probably say something like, "Well, I, you know, I kind of stare at code a lot. Yeah, and I kind of, I kind of stare at code and stare some more. And then I understand it and I write a report. Or then I, you know, find a bug and then I fix the bug."

Male Delegate: [00:15:53:04] So, peo-- people generally have, have at least this kind of a idea of the process of reverse engineering. And that's, that's kind of how we explain it to people. Trying to think a little bit deeper. You know, when we're reverse engineering binaries, you know, what are we really doing? What's the process behind just staring at code?

Male Delegate: [00:16:23:17] So, I break it-- I, I break it down like this, and make another argument here. And my-- I'm going to make two arguments. So, my, my first argument is that, when we're reversing code, we're operate-- we operate on at least five levels of abstraction. And we're generally working from the lower levels of abstraction, up to the higher levels; to kind of get, get that higher level understanding of the code and how it works.

Male Delegate: [00:16:54:08] We start out down here at the bottom; kind of, the-- you know, your basic disassembly opcode level. We sort of work our way up towards, you know, the high level language constructs. What is this function doing? We, we label, you know, a function, we give it a name. Then usually we go, we, we go a level above that and we kind of look at groupings of functions. So, we sort of organize functions into, you know, these, these functions or network code. These are crypto code. These are, you know-- this is, this-- a, a particular protocol, you know, what have you.

Male Delegate: [00:17:40:08] And, then, eventually, once we have that all kind of figured out, we can look at, sort of, whatever the high level functionality is that we care about. What's the main thread, the app-- you know, the application doing? How is it using all those, those underlying libraries to do what it wants?

Male Delegate: [00:18:00:09] And I-- so, so, that would be my, my argument. And we're also-- you know, so, we're, we're generally moving upwards in levels of abstraction and-- but we're also jumping around back and forth, right? We have an idea in our head, "Okay, I kind of--" I, I have an idea that I want to figure out what this particular thread does. Or I, I want to figure out what this particular algorithm in the code does. And so I'm, I'm, kind of, like, diving down to the bottom, I'm looking at little snippets of code, trying to label functions, and then trying to bubble back up to figure out, you know, what the, what the, the overall question that I'm trying to answer.

Male Delegate: [00:18:41:02] So, my, my second argument here is that, what we're doing across, in, in every level that we're working at, is we're translating code to our natural language, right? You're, you're-- you know, we are, as reverse engineers, we are code to language translators. And, so, you know, when we stare at, at opcodes, at disassembly, you know, we, we make some notes, you know, "This, this moves the, you know, the, the key parameter into this register, whatever." And, you know, we're making notes in our natural language. And then we're labeling functions or parameters, you know, in our natural language. And, ultimately, moving up to, you know, the, the higher level where we can, either, you know, write a report that explains to somebody how this code works. Or, you know, maybe we're, we're then figuring out how to interface to that code.

Male Delegate: [00:19:41:18] Or if we're doing, you know, vulnerability analysis, then we c-- that we're using that as a, as an entry point to, you know, to, to run our automated vulnerability analysis, or fuzzing, or, or what have you. So, the argument is that we are, you know, we are natural language translators.

Male Delegate: [00:20:10:11] And, and, being natural language translators, you know, we're-- I think reverse engineers, we tend to be very highly trained, you know. We're-- we are-- you know, we usually have a background in something in computing, and then we level-- or we layer lots of training onto that. And, so, I'm constantly running into, you know, this, this idea that people are like, "Well, maybe we can automate the process. And maybe we can take reverse engineers out of the loop." You know, "Maybe if we get If we get decompilation perfect, we can, you know, we can, sort of take reverse engineers out of the loop." Mainly from people that are, kind of, frustrated with, with, you, you know, either trying to re-- retain the, the trained people that they have, right, because people are, you know-- people have freewill, or at least the, the appearance of freewill.

Male Delegate: [00:21:05:14] And, so, you know, you, you train somebody up and, you know, and everybody, everybody has this problem. We're, we're training people up, people are, you know, leaving. And, so, folks want to-- like, there-- there's sort of this hu-- you know, we have this sort of human capital problem in, in RE. But, my argument would be, we're never going to take reverse engineering analysts out of the loop. I think-- and I don't think that's a, a-- I don't think many people in this room would disagree with me. But, I think we can make reverse engineering accessible to more people. I think we can make it, not a, a discipline that requires lots of training that's, you know, lay-- layered on.

Male Delegate: [00:21:56:15] Okay. So, how do we do that? What kind of problems do we need to work on to get there? And this next-- so, this next section is just kind of a laundry list of challenge problems, that I think are really interesting to, to, to work on. I think these are the sort of sub-problems that need to be solved, in order to really have, have a, have a real code to natural language translation capability. So, I mentioned my-- I call it-- I, I call this Jarvis for codes. If you're familiar with, with the-- who's, who's seen the Iron Man movies? Raise your hand. [LAUGHS] Most, most people. And-- most people. And the rest of you don't want to admit it.

Male Delegate: [00:22:45:10] The-- you know, so, so, Jarvis is like this, this, he's sort of this AI, this, this sort of like expert system, right, that helps. I'm, I'm thinking more when, when Tony Stark is doing his inventing, you know, he's the, sort of, the, the expert system in mechanical, like, mechanical engineering and physics, right? And he's-- Tony is talking to him, and, and Jarvis is giving him information about, you know, whatever, the physics and behind whatever he's trying to do.

Male Delegate: [00:23:20:00] So, you know, Jarvis for code, you know, a, a Watson, you know, is, is a more real, real world thing. I think that's, I, I think that's really possible. I think that's an achievable goal in, you know, ten to 20 years. I, I thi-- I, I see no reason to, to think that we couldn't solve this. And, so, these are the, the, kind of, the next set of challenge problems that I think need to be worked on towards that goal. So, you got variable name prediction is, is something that people are starting.

Male Delegate: [00:24:02:02] And the reason I share these here is, I hope that you'll think about this, I hope that you'll take this back to, you know, your, your organization, you know. Folks that are more involved with, with research, you know, think about these, these problems. I hope that it kind of motivates folks to, to, to work on them. These, these are, you know-- they're difficult things, but this is where where I think we need to go.

Male Delegate: [00:24:30:11] So, Variable Name Prediction that the, the promise. So, I get, like, give me a, a blank decompiled function and, you know, can I output var-- output meaningful variable names? You know, rename variables in the blank function so that it looks like the original code? There's some initial work from CMU. CMU currently known out there. And they, they were inspired by work that's out there doing-- recovering variable names from obfuscated JavaScript. So, I think their, their approach kind of showed that it's, that it's possible, but they're just kind of getting started there.

Male Delegate: [00:25:19:12] Next one, Statement Commenting. So, that would be given, given a blank decompiled code statement, you know. You get a fragment of high level language code, output, c-- comments in natural language describing what that, what that does. So, there's existing work out there in automatically labeling code snippets. This is-- it's mainly for things like-- you think like stack exchange, or something like that, where people are posting code snippets. So, there's research out there that will label those code snippets and try-- attempts to, to label it, you know. This is, you know, Python Code, or this is networking, this is Crypto code; that kind of thing. This could certainly build on, you know, Variable Name Prediction; if Variable Name Prediction was, was well solved.

Male Delegate: [00:26:15:00] Next one, Function Summarization. So, so, given a blank decompiled function, output comments in natural language that, that, that summarizes the function. This is-- there's, there's a decent amount of research on this for, like, you know, just regular source code, that has labels in it, you know. So, given, you know, piece of source code, just out, output a summary, output an automated summary of, of what that source code does. And, and I've, I've found a number of, of references on that.

Male Delegate: [00:26:55:13] Language summarization, in general, you know, as I, as I, sort of, talk to our natural language processing experts, it's, it's a-- and talking to the natural language processing experts at APL, it's a difficult problem, in general, because of lack of data sets. So, a lot of times in, in natural language, you're-- in natural language processing research, you're, you're taking something that somebody translated by hand, you know, and you have a big data set of, here's a whole bunch of articles, you know, in English and they were hand translated into Spanish, and then I can use th-- that set to, to learn from.

Male Delegate: [00:27:40:19] Summariza-- So, language summarization, in general, is a hard problem, because people-- we-- you know, there aren't a lot of data sets out there where somebody has taken a, a big set of written work and then summarized it, right? So, but if, if-- you know, I think that could be really, really powerful, if, if we could apply that to reverse engineering. Possibly building on the, the other challenge problems and feeding in any kind of like-- you know, working with a big cor-- corpus of data and figuring-- feeding in man page, or other kind of documentation.

Male Delegate: [00:28:31:07] So, these are all-- and if you'll, sort of, notice, these are kind of working their way up the levels of abstraction that I kind of laid out before.

Male Delegate: [00:28:40:03] Next one, so, so, Library and Object Organization. This is something I call the, the CodeCut problem and I've, I've done some work on it. The, the problem is-- so, given a, a, a fully linked binary, find the original object file boundaries, you know, in an automated way, or, you know, object files, or, or static library boundaries in the original binary.

Male Delegate: [00:29:10:02] So, I did, I did some work on this and have code that's available on, on GitHub that does that. It sol-- it, it does an okay job of solving the problem. But I think this is one that could really use more eyes and more-- you know, I'd, I'd love for people to, to check out the, the tool and figure out, you know, is, is it, is it working? You know, give me, give me feedback. Or, also, just think about other ways to solve that, that problem.

Male Delegate: [00:29:50:02] So, imagine, at that point, if, if you can, if you can solve that problem where you have a-- where you can give meaningful text output at the function level, and then you can also take your binary and automatically locate the sections of functionality. At that point, hopefully, if it all kind of works together, you could, you could, you have your system output a description of te-- you know, a text descriptions of each of those modules; each of those objects within the, the binary, and that would be really cool. So, you could basically, at that point, have a-- you, at that point, have an automated software architecture diagram, with a meaningful, you know, description of each piece of code. So, that, that's, you know, what I think we should work on. Those are, those are the the challenge problems.

Male Delegate: [00:30:58:19] I wanted to highlight here, there's two things. So, so, last year, at, at-- here at Jailbreak, So-- Sophia D'An-- Antoine talked about her work in doing, doing some foundational stuff here in asm2vec, which is the-- you know, the idea here is, in natural language processing, a lot of times you take things like words and you convert them to a, a vector, to a multi-- to a many dimensional vector, in order to do learning, you know, on, on just a, sort of, a simplified, that simplified vector space. So, Sophia pr-- presented her, her work there. I think that's, that's foundational and I think that's something that we need to, to build on.

Male Delegate: [00:31:49:18] At the, at the same time last year, there was another project called code2vec; which is taking a similar approach, but for high level language code, you know, c-code. And, so, those are both-- those are two things that I think we can build on there.

Male Delegate: [00:32:10:07] Okay. So, that's, that's sort of the theoretical stuff and it's-- and kind of, you know, the, the next question is, okay, how do, how do we get started and, it's, it's also kind of like, h-- "Hey man, like, what did you actually do? Are you just asking us a bunch of questions?" Or, you know, "Did you actually do something to, to, to warrant, you know, talking about this today?" So, yeah, yeah, I did. [LAUGHS] I, I ha-- I did get started on this. The--

Male Delegate: [00:32:42:12] So, as I, as I talk to folks and they're doing natural language processing, what constantly comes out is, they're like, "Okay, well, if, if you're going to do this, you need a big data set to process." Right? You need a big-- a, a large set of labeled data to, to feed in. And it's like-- and, so when I, when I talk to folks, they're like, "Oh well, you've got data, right? There's, like, source code." you know, "There's lot of source code out there. There's lots of firmware out there." Right? Like, and, it's like, well, yes, we've got lots of source code and lots of firmware, but we actually don't have really any labeled data out there. We don't have-- and, and what that would look like, would be source code that is compiled, and, during the com-- compilation process, we would save off, you know, debugging output and compilation artifacts, and save, save binaries with symbols and we would put that all together, and that would be our, you know, that would be the labeled data set. There's almost zero of that out there, pretty much.

Male Delegate: [00:33:56:21] And, so, ideally, if we were going to do this, it would be cross architecture. From, from my perspective, like, it would be cool, you know, if it works on x86 code. But, ideally, I want something that, that will work on say any architecture that IDA, IDA or GHIDRA supports. We don't want it to be over-trained on x86 code and then not work on those other architectures.

Male Delegate: [00:34:25:17] So, I, I started this, this project to build what I, what I call ALLSTAR, the, the Assembled Labeled Library for Static Analysis Research. And what I'm doing is, we're, we're taking the Debian "jessie" distribution, that's what it's based on, and we're building Debian for, basically-- so, all the packages, all the, all the what do you call that? The, pack-- the packages that actually have, have code in them, essentially. And building them for all the architectures that Debian supports. Kind of overriding the build process, so that we save all of that, that debugging output and symbols.

Male Delegate: [00:35:17:10] So, Debian, they su-- support six architectures. And, to do this, I extended this project called Dockcross, which is, which is a set of docker containers that have cross compilers in them. Pretty, pretty simple technical details are basically just, there-- there's a, a Debian package build command and we're just overriding flags to force it to do intermediate output.

Male Delegate: [00:35:51:02] This is, this is running now, so, so I have it, I have it running. It's, it's going to take about six weeks to build. I'm kind of like, I have to, sort of, babysit it and , and fair it out like badly behaved packages. So, hope-- hopefully, when we like-- the next time we build it, it'll take-- it'll go, go cleanly and go, go five to six weeks. And in, in the, the-- I have a high-powered VM that's, that's running all this, and it does 35 to 55 packages an hour.

Male Delegate: [00:36:26:12] So, this is the-- this is what it'll look like, the pro-- projected when it's done. It's going to be about, about a Terabyte of storage. And it's going to do about 160,000 64 by x86 binaries, and then about, about 20/22,000 fully cross platform binaries. And, the, the difference there is, is-- so, Debian has a, has a build spec that packages are supposed to follow, and if you follow the build spec, then it, it actually builds, you know, your, your package will proper really build for all those architectures. It turns out, a lot of them don't. But, the, the-- you know, I would argue that, at, at 20,000 cross plat-- cross platform binaries is, is a, a really nice data set to get started with.

Male Delegate: [00:37:31:14] So, if, if you're familiar with some of the research that's out there, just-- both the CMU project and the-- there's also some, some stuff coming-- some papers coming out from the DARPA Muse program, which is they, they were mining lar-- they're also mining, like, sort of, large-- Mic is back? They're, they're, [LAUGHS] they're mining, sort of, you know, large software data sets. Both, both groups use this GitHub approach, where they've-- and, and that's kind of-- I think that's the-- when I talk to people, people are like, "Yeah, why don't you just like pull down large sections of GitHub and build them?" which is not a bad idea, and that's what they've done.

Male Delegate: [00:38:22:06] The-- with GitHub, there's a lot more packages. The-- what these, what these projects have found out, though, is that a lot of the code on GitHub is copy and pasted. And, they, they estimate up to 70%. So, there's a lot of copy and pasted code. The other part is that GitHub doesn't have a structured build process. They, they generally just, like, pull down GitHib repositories and run, like, configure and make and sort of hope it works. And it, and it, it works somewhat, you know. They're, they're able-- but, they ultimately, you know-- one of the, the, the papers I was looking at, ultimately, they ended up with about, you know, 20,000 packages that were actually, actually building.

Male Delegate: [00:39:17:08] So, GitHub is also-- so, GitHub is less likely to build for non x86, you know, platforms; so that's, that's a drawback. The other thing about GitHub is that the licensing is unclear, right? So, like, nobody is really sure-- you know, there's no, like, real way to check, you know, what, what is the licensing for this, you know, for these packages? And, therefore, like, it's unclear, then, if I'm building the binary, legally, am I allowed to, you know, re-host that, or, or, or what? So, like, like, I think for the Muse program, they, they have a-- they, they have a data set they've put out there that has snippets of GitHub code, but not-- they, they haven't republished their, their, you know, their, their binaries.

Male Delegate: [00:40:11:18] So, the-- with, with Debian, it's all GPL'd. So that means that, you know, I, I can build it and we can put it out there for the community to use. We can re-host it, you know, with no, no legal issues. I'd like to think, too, that the Debian code is like a little bit more serious, or polished on average than, you know, than your average GitHub thing, right? People just, you know, they hack on something and, and throw it on GitHub; so it's, it's kind of a, a mix of, of what you'd find there. But, hopefully, on, on average, the, the Debian code is a, just a little bit, you know, more polished.

Male Delegate: [00:40:53:15] This is-- you can check this out later if you're interested in, you know. And, and, again, this, this, this is a data set that we're hopefully going to put out there for any kind of, you know, any kind of research; not just natural language processing. So, you can, you can check this out later, if, if you're thinking, "Hey, I could really use a, a large set of binaries to, you know, to run my algorithm on."

Male Delegate: [00:41:23:06] This is what a, what a single data record looks like. So, we're putting-- we're outputting, you know, a, a bunch of these, kind of, in-- intermediate debugging output in the, the compilation process. Saving off object files. We're leaving in symbols. We, we save off any system library dependencies, any, any documentation for the package. And then we, we also generate an HTML index, so it can be human browseable, and a JSON index, so that it can be, you know, parsed with, you know, with your program that wants to interface to it.

Male Delegate: [00:42:08:16] That's what it's going to look like when we publish it, which will hopefully be in, in the next couple of months. You've got the-- it's just the splash screen and the, you know, one package on the side, sort of what it looks like.

Male Delegate: [00:42:32:07] Yeah, so, basically, as soon as, as it's built, you know, we'll, we'll work on the open sourcing. Internally, we've got a couple of research projects that are planning to use it. But I hope that everybody can make use of it and, and think about what you could do with a big data set. And, yeah, I hope to see a lot of research on that in the future. That's all I've got.

Male Delegate: [00:43:14:17] I'm not sure. I think I've got a little bit of time. Oh yeah, I want to say, I want to say, certainly thanks to, to Jailbreak, to Tom and Heather for, for having me out. Thank you to, to folks at APL, both in my management that helped me on this, and folks that, that funded the work. I wanted to say a special thanks to Halvar Flake. Halvar, for whatever reason, like, I randomly started talking to him and he, like, answered my, my, my DMs on, on Twitter, and really helped me, sort of, work through brainstorming and, and kind of, en-- encouraging me to, to put this out there. So, I, I just really-- that was, that was awesome, and so I really appreciate him, him doing that.

Male Delegate: [00:44:16:09] Also wanted to thank Igor, Joxean and, and Joan, who helped me track down all of the references, in terms of the historical research on, on how things work. And with that, I'll ta-- I'll take questions. And since you all have suffered enough, here's a, here's a beer meme.

Male Audience Member: [00:44:41:17] When you were working on object files, boundaries detection, what's the effect of link time optimization?

Male Delegate: [00:44:49:07] Oh, that's a good question. The, the-- so, the question is, what's the effect of link time optimization on object file boundary detection? And link time optimization is, basically, the linker can rip apart the object files and do what it wants. Remove things, you know, remove extra code, or, or whatever. The answer is, I don't have a good answer. I don't see it, I don't see it a lot in the embedded space, which is kind of why I don't have a good answer. A lot of times, it's not necessarily used, unless-- in, in the embedded space, it would get used if people are really trying to s-- their code is a little bit big and they're trying to squeeze it down for size. But I would love for you to test it out. [LAUGHS]

Male Delegate: [00:45:49:10] Any other questions?

Male Audience Member: [00:45:56:01] So, is this mostly just machine learning completely? Or is this-- do you get your hands dirty with like defining rules, or?

Male Delegate: [00:46:09:19] I, I haven't gotten-- so, in-- what I've worked on so far is the code cut stuff, which isn't machine learning. I'm, I'm hoping to kind of collaborate in-- internally with our folks at APL that do machine learning and natural language processing, to work on some of those, those things. But there's, there's really a lot of different possible approaches there in, you know. I, I think, ultimately, it's a translation problem, and so there's as big overlap, I think, between, sort of, you know, the-- between machine learning and, you know, and natural language translation, you know. Usually, those two things sort of go together, but I'm, I'm not necessarily limiting it to any par-- you know, particular set of approaches, you know. The-- these are just, you know, the challenge problems that I think people should target.

Male Delegate: [00:47:11:16] I, I should say, I guess, I should repeat the question. The question was, you know, is, is this, you know, specifically, machine learning, is this all machine learning? And so, to summarize, I'm not trying to prescribe ne-- you know, necessarily, an approach there, but I'm sure that's going to involve, you know, a lot of it. And that's, that's why you, right, that's why you would put out a big data set.

Female Audience Member: [00:47:38:22] Have you found that it's difficult to get some of that [INAUDIBLE] some of that [INAUDIBLE] flags and things from people at APL? [INAUDIBLE]

Male Audience Member: [00:47:55:03] The question was, do I find it difficult to get source c-- source code to compile people to give, give you source code to compiled? The, the nice thing about Debian is that it's all out there and it- and it's, and it's structured; so I didn't have to, I didn't have to ask anybody [LAUGHS] for it, yeah. The-- and the, the--

Male Delegate: [00:48:22:18] And, another thing, too, with, with our data set is that we've, sometimes with these open source, you know, like Debian, that-- and, and that version in particular, eventually, Debian stops hosting it. But, as, as part of our project, we've completely mirrored all of their code, and then we, we will put that out there as well, you know. So, hopefully, for a very long time, you'll have both the binaries, and you'll still be able to pull all those Debian packages, all the source code packages to look at. So, yeah, no, thankfully, I didn't have to, to, to beg anybody to give me stuff.

Male Delegate: [00:49:11:01] So, yeah, so in this, in this meme, I'm-- if you couldn't figure this out, I'm, I'm junior, right? Like, I-- the-- not, not just because I don't like IPAs, but, you know, also in these memes, like, senior always wins the, the argument anyway, so I always lose the argument, and people are just like, "Here, IPA."

Male Delegate: [00:49:34:15] Alright, thanks everybody. And, so, this is my contact info. And we can talk afterwards. If you follow me on Twitter, I'm, I'm-- I've been putting out updates on ALLSTAR. You can follow me for updates on that when it's published. So, thanks.