Synthesized DNA Malware with Peter Ney.
Dave Bittner: [00:00:03] Hello everyone, and welcome to the CyberWire's Research Saturday.
Dave Bittner: [00:00:07] I'm Dave Bittner, and this is our weekly conversation with researchers and analysts tracking down threats and vulnerabilities and solving some of the hard problems of protecting ourselves in a rapidly evolving cyberspace. Thanks for joining us.
Dave Bittner: [00:00:23] I'd like to tell you a little bit about our sponsor, Cybrary, the people who know how to empower your security team. Cybrary is the learning and assessment tool of choice for IT and security teams at today's top companies. They deliver the kind of hands-on training fifty-five percent of enterprises say is the most important qualification when they're hiring. And once you hire, you want to retain. And Cybrary helps there too, because seventy percent of employees say professional development is a big reason for staying on board. Visit www.cybrary.it/teams, and see what they can do for your organization. Not only is it effective, it's affordable too, costing just about a 12th of what legacy approaches to training would set you back. So contact Cybrary for a demo. That's www.cybrary.it/teams, and tell them the CyberWire sent you.
Peter Ney: [00:01:20] DNA is a biological molecule that's designed to store information. All living things have DNA.
Dave Bittner: [00:01:28] That's Peter Ney. He's a Ph.D. candidate in the Allen School of Computer Science and Engineering at the University of Washington, where he's advised by Professor Tadayoshi Kohno. His current research is focused on understanding computer security risks and emerging technologies like DNA synthesis and sequencing, and the new threats posed by maliciously crafted synthetic DNA. Along with his colleagues at the University of Washington, he's one of the authors of the paper "Computer Security, Privacy, and DNA Sequencing: Compromising Computers with Synthesized DNA, Privacy Leaks, and More."
Peter Ney: [00:02:03] DNA is made up of four types of molecules: adenine, cytosine, guanine, and thymine, which we just shorten to A, C, G, and T. And so DNA molecules are just basically a linear sequence of these A's, C's, G's, and T's. And so you can think of it as being very similar to digital data, but instead of having binary data, like zeros and ones, DNA is actually made up of four different types, A's, C's, G's, and T's. So kind of like base-4.
Peter Ney: [00:02:34] DNA sequencing is just the process of, when you're given a particular DNA molecule, you want to know what is the actual order of these bases in the DNA strand. That's pretty much, at a high level, what DNA sequencing is. It's been around for about 40 years, since the early 1970s. And DNA sequencing at that time was a fairly slow and expensive process.
Peter Ney: [00:02:58] But all this changed in the early 2000s with the development of a new class of technologies, which are kind of broadly referred to as "next generation sequencers." And unlike their predecessors, these sequencers are actually capable of sequencing a massive quantity of DNA all in parallel. So you can do things like sequence an entire human genome, or maybe a hundred human genomes all at once. And so what's happened is that DNA sequencing has gotten really, really cheap since this started. And so, in about 2001, it cost around 100 million dollars to sequence one human genome. And today we can do it for about a thousand.
Dave Bittner: [00:03:36] So, contrast that to me. I think many of us are familiar with some of the consumer DNA sequencing services, you know, that for a hundred dollars you can get your DNA sequenced, and find out your genetic background. How does that compare to what you're talking about with this kind of sequencing?
Peter Ney: [00:03:54] When I'm talking about sequencing, I'm saying given a DNA molecule, I want to know every single base in the order of all the bases in the molecule, so this way, you can think of this like, you know, proper full sequencing.
Dave Bittner: [00:04:04] I see.
Peter Ney: [00:04:04] There are there are other kinds of techniques that kind of just sequence little, individual bases, but not all of the bases in a DNA molecule. And so that's, for example, if you've heard of 23andMe.
Dave Bittner: [00:04:15] Yeah.
Peter Ney: [00:04:15] That's the kind of sequencing they do. What I'm talking about is typically referred to as full genome sequencing. So it's actually, sort of trying to sequence, you know, every single base in the human genome.
Dave Bittner: [00:04:25] And how much data are we talking about with the full sequence of, say, the human genome?
Peter Ney: [00:04:29] So, the human genome itself is about four billion bases long. So typically when you generate sequencing data, you have lots of redundancy, and so you sequence the same parts of the genome over and over again, maybe upwards of twenty, thirty times. So typically you're talking about generating hundreds of billions of DNA bases, which in terms of storage might be upwards of 20 gigabytes or more. And some of the really high throughput sequencing machines which have been developed, can do, I think they might generate terabytes of data in a single sequencing run.
Dave Bittner: [00:05:03] So basically, given that DNA is a way to store information, and these systems are taking the biological thing that is DNA and turning them into computer data, describe to us your approach for trying to exploit that.
Peter Ney: [00:05:18] I would just add onto what you just said, which is that, you know, really what you can, when you think about DNA sequencing, what's actually happening is that it's this kind of intermediary between biological data and digital data. And so, you know, I think we've known for a long time that any time computer systems process digital data, there is the possibility that that data could be used maliciously to target vulnerabilities in that software.
Peter Ney: [00:05:44] And so, since DNA sequencing is just taking these these biomolecules and turning them into digital data, we were really wondering, can we actually start by making particular biological DNA samples so that when they're sequenced they would actually end up as malicious sequencing data files.
Dave Bittner: [00:06:03] So how did you determine what you were going to target? Am I right, reading through your research, did you sort of, you know, artificially set up some vulnerabilities within the DNA sequencing software?
Peter Ney: [00:06:15] That's correct. So really, our research was kind of two phases. The first phase, we were interested in kind of a proof of concept, to see whether you could actually, starting all the way with DNA molecules, end up with sequencing files that would target, say, a vulnerability that was discovered in the software. So we were really more interested in kind of trying to understand the limitations of both generating artificial DNA molecules and the sequencing process. And then later on, we actually did kind of a security analysis of existing DNA analysis utilities.
Peter Ney: [00:06:53] We have the ability, it's called de novo DNA synthesis. So we have the ability to make completely artificial DNA molecules that don't derive from biological sources. So in some sense you can think of us as having the ability to write kind of arbitrary DNA sequences. The problem is, is that both our ability to make DNA molecules is somewhat constrained. You can't just make any DNA sequence. There are limitations there, as well as the DNA sequencing process has lots of noise and randomness that happens just inherent in how sequencing works.
Peter Ney: [00:07:26] And so, it's not totally clear upfront whether you can actually have enough control over the information that's flowing to digital data files to actually create malware. So that was really our research question.
Dave Bittner: [00:07:40] In terms of creating DNA from whole cloth, did you have to deal with the fact that the scanning software was expecting to see certain things?
Peter Ney: [00:07:51] Yeah, so at the end of the day, what you're still getting out of the sequencer is going to look like, basically, DNA sequences. But the thing is that these utilities are doing all sorts of analysis on this data in all sorts of complicated algorithms, to manipulate it in particular ways. And so, for example, you might, say, generate sequences so that when they're sequenced a particular algorithm gets into a weird state, or processes it in a particular way so that it would maybe, say, is processing data that's larger than it would expect. So you can think of, like buffer overflow vulnerabilities, or different things like that.
Peter Ney: [00:08:28] And I would also point out that you might ask, like, sort of what kind of analysis are you doing too. And I think the idea is that, the data that comes out of these sequencers is really quite raw and isn't very useful by itself, because what you're actually doing is you're taking these long DNA molecules, say, like a full chromosome from a human genome, and you actually break it into little tiny pieces. And so what you're actually doing are sequencing, you know, hundreds of millions of really short DNA molecules. And then, to actually kind of reconstruct larger DNA sequences, or to ask particular biological questions, you're going to do all sorts of analysis and complicated algorithms using these short DNA fragments that you've sequenced.
Dave Bittner: [00:09:10] And one of the things you all discovered in your research was that this software that is used for the DNA analysis was lacking some basic security best practices.
Peter Ney: [00:09:20] Yeah, that would definitely be true. I think it would be, it's helpful too to understand, you know, who is writing this software. This whole space has been changing so much that a lot of the utilities that are used by scientists to analyze DNA data, have actually been written by either biologists or maybe people with some bioinformatics background, but a lot of them--some of them do, but not all of them--don't have kind of formal, kind of software development experience, and a lot of these programs are written in languages like C and C++, that if you're not careful oftentimes contain vulnerabilities. You find that, in some sense, because there probably hasn't been much adversarial pressure on these programs so far, that they are somewhat lacking in security.
Peter Ney: [00:10:07] So we found, for example, many buffer overflow vulnerabilities in these programs, which is mostly what we were looking for. But also that they were using a lot of function calls that are known to create security problems. And just looking at a bunch of metrics, they seem like a broad class of these programs don't seem to be written with security in mind.
Dave Bittner: [00:10:27] It's an interesting lesson I think for security professionals in particular in that, you know, it seems like this was an attack surface that no one had ever really considered before.
Peter Ney: [00:10:36] Yeah, I think so, and I think people have probably thought, well, maybe the traditional security problems you get, like just sending, say, malicious files back and forth, maybe even DNA sequencing data files, I think people have thought about that. But it is interesting that, you know, anytime you have information that eventually ends up in a computer system, you have to consider who's generating that data, where it's coming from, and try to design programs that are robust to it. And so I do think there is kind of a broader lesson, which is that any time you're taking data, you need to think about security.
Peter Ney: [00:11:10] In some sense, DNA is very similar to digital data, because it's discrete, because there's only, you know, the four bases. So it actually, there's a pretty close analog to digital data. And so you have a lot of control over DNA, and what you can create. So, it gives you a lot of control over the types of inputs you can send to these systems.
Dave Bittner: [00:11:29] What are some of the bad things that people could potentially do, you know, exploiting the things that you all have learned?
Peter Ney: [00:11:34] Well, I would just say, I would just start by saying that DNA analysis is getting fairly ubiquitous, and so we're seeing DNA sequencing being used in all sorts of domains, like medicine, so genetic testing, personalized medicine, forensics. The new fields of, sort of, bioengineering, genetically modified organisms. And so there's a lot of different assets and things that attackers might be interested in manipulating, or stealing, or modifying.
Peter Ney: [00:12:01] So, you could imagine, someone could use DNA as a vector to just steal sensitive sequencing data. So this could contain things like intellectual property, or just DNA sequences from individuals. They might also be able to modify in malicious ways, say, genetic tests. So could they, you know, if you control a system that processes DNA data, you could use that to manipulate, say, genetic testing and make people look like they have genetic diseases they don't actually have, or the opposite, you know, mask known genetic diseases.
Peter Ney: [00:12:36] I think forensics is very interesting, because if someone is able to create DNA that they know will eventually be sequenced, so you could think of it like a crime scene, and then that data is then sequenced through some particular workflow, and processed through vulnerable programs, then you could imagine someone manipulating forensic systems for example.
Peter Ney: [00:12:57] We're going to enter a world in the near future where pretty much everyone's genome is going to be sequenced. Sequencing is going to be a very routine procedure. And so, especially as the price of DNA sequencing continues to drop.
Dave Bittner: [00:13:11] One of the things you discovered was a side effect, and that was information leakage. Can you describe that for us?
Peter Ney: [00:13:17] Yeah, so there's, the way these machines work to get so cost effective, is that you typically don't just sequence one sample at a time; you actually sequence many samples at a time. And so what actually happens is, you take, let's say you have five different people whose genomes you want to sequence, you might take these five individuals and pool their genetic data together, and sequence it all at once.
Peter Ney: [00:13:40] But to actually figure out, you know, whose sequence goes with which person, you actually, before sequencing all of the samples have a unique DNA barcode that's added to each sample, so that at the end of sequencing, you can actually kind of figure out which DNA went with which person. We kind of call this "sample multiplexing."
Peter Ney: [00:14:01] The problem is, our ability to sort of demultiplex. So you pool all these samples and then, you know, try to separate them out, try to separate all the sequencing data out at the end. Problem is, is that there is sort of a low but small amount of data leakage that happens between the samples. And so this is kind of, you can think of this like a side channel. So, you know, if an attacker is capable of sequencing a sample alongside other DNA samples, they might actually be able to influence those other samples in particular ways.
Peter Ney: [00:14:33] So, for example, if there is vulnerable sequencing software that's going to process this data they could push malware, or the other way they could actually read data from other samples. So, because we know that that data from other samples will end up in files that belong to the malicious actor. So, in some sense, you have the ability to both kind of pull data and push data into other sequencing data files. And in our experiments, we were able to find that there was some information leakage. So it's not clear how imminent this threat is, but I think it's something definitely to consider going into the future.
Dave Bittner: [00:15:06] Was that information leakage random or was it something that you were able to control?
Peter Ney: [00:15:11] Yeah, at this point it's fairly random. The thing is, is that if the attacker is able to make a particular DNA sequence, and so that their entire sample was, say, made up of just one DNA sequence, then in some sense, while the particular DNA that is bled over, you can't control that, but since it's all made up of one sequence you'll end up knowing what sequence is going to move into the other samples. You might have control over it, but it is still a fairly random process.
Dave Bittner: [00:15:40] So your ability to custom sequence DNA, is that at all a limiting factor in terms of access to that, or price of that?
Peter Ney: [00:15:50] Sequence or create?
Dave Bittner: [00:15:51] To create.
Peter Ney: [00:15:53] Create, yeah so it's a synthesis.
Dave Bittner: [00:15:54] I'm sorry, the synthesis, yeah.
Peter Ney: [00:15:56] Yeah, yeah, so it is really easy actually. So we actually used an outsourced synthesis service. There are many of these companies. And what you do is, you basically go into their web form, so they have a web form with a big open box. You just paste in the DNA sequence you want to order, and they'll ship it to you. So, no and it costs about a hundred dollars to order our sequence.
Dave Bittner: [00:16:19] Have these people never seen any 1950s science fiction movies?
Peter Ney: [00:16:24] [Laughs] It's a good question. These synthesis services do look for, it's interesting though, they do look for known malicious biomolecules, so say, virus sequences.
Dave Bittner: [00:16:36] Interesting, yeah.
Peter Ney: [00:16:36] You know, so there are certain types of sequences they do look for, but they're certainly not looking for sequences that might contain computer code or computer data.
Dave Bittner: [00:16:45] What's been the reaction so far to your research?
Peter Ney: [00:16:48] You know, I think it's been pretty much what we expected, which is, in some sense, what we demonstrated is really still a proof of concept. There were lots of challenges we encountered. It was still really challenging just to make it work in sort of the most ideal circumstances. So we don't think it's, sort of an imminent threat. But I do think we've gotten people to start thinking about, hey, we're doing all this DNA sequencing, we're sequencing all this really important data, we're going to be doing a lot more sequencing in the future, the technology is changing rapidly, we really need to start thinking about these, sort of novel, sort of vectors that data can start moving into these computer systems.
Peter Ney: [00:17:28] And so, I think it's really more just letting people start thinking about this and not so much that it's sort of imminent right now. But I'm hopeful that in five or ten years maybe when these threats are more, maybe more imminent, that we'll have at least had five or ten years to start shoring up the security of the software that's doing all this DNA processing, you know, before more bad things happen.
Peter Ney: [00:17:49] And I would, one thing I'd mention to which is really cool, there's some really interesting use cases of DNA sequencing that are on the horizon that really make this I think more relevant. So one really cool use of DNA sequencing is actually using DNA as a method to store digital data. And the reason you would do this, is because DNA is very stable and can last for hundreds or thousands of years, and it has very, very high density. So I have heard, for example, that you could store all the digital data in the world inside of a car if it was stored in DNA. So really what's happening is that we're actually going to be continuing to blur the line between biological and digital data. And so I think there's going to be some really interesting threats and vectors moving into the future.
Dave Bittner: [00:18:35] And what are your thoughts in terms of what needs to be done to protect against the types of exploits that you all have explored?
Peter Ney: [00:18:41] I think the first and most obvious is that, just common security best practices don't have buffer overflow vulnerabilities. You know, do security audits of your software, do some input validation. So kind of routine security practices, and start thinking about DNA sequencing software in the same way people think about Internet services, web servers things like that. And I think that would go a long way, because right now I think these kinds of attacks are challenging, but the software security is so poor that they might actually be possible going into the future.
Peter Ney: [00:19:16] So I think that's kind of, at least in my opinion, sort of like the first step more than anything else.
Dave Bittner: [00:19:25] Our thanks to Peter Ney from the University of Washington for joining us. If you want to read the complete paper, it's available online. It's called "Computer Security, Privacy and DNA Sequencing: Compromising Computers with Synthesized DNA, Privacy Leaks, and More.".
Dave Bittner: [00:19:39] And thanks again to our sponsor Cybrary for making this edition of Research Saturday possible. Visit www.cybrary.it/teams and see what they can do for your organization. Don't forget to check out our CyberWire Daily News Brief and podcast, along with interviews, our glossary, and more on our Web site, thecyberwire.com. The CyberWire Research Saturday is produced by Pratt Street Media. Our coordinating producer is Jennifer Eiben, editor is John Petrik, technical editor is Chris Russell, executive editor is Peter Kilpe, and I'm Dave Bittner. Thanks for listening.