Ep 12 | 4.25.21

Channeling the data avalanche.

Transcript

Dave Bittner: Hello everyone, and welcome to CyberWire X, a series of specials where we highlight important security topics affecting organizations around the world. I'm Dave Bittner. Today's episode is titled 'Channeling The Data Avalanche'. Proliferation of data continues to outstrip our ability to manage and secure it. The gap is growing and alarming, especially given the explosion of non-traditional smart devices generating, storing and sharing information. As Edge Computing grows, more devices are generating and transmitting data than there are human beings walking the planet. High speed generation of data is here to stay. Are we equipped as people, as organizations and as a global community to handle all this information? Current evidence suggest perhaps not.

Dave Bittner: The International Data Corporation predicted in its study, Data Age 2025, that enterprises will need to rely on machine learning, automation and machine to machine technologies to stay ahead of the information tsunami. While efficiently determining and iterating on high value data from the source in order to drive sound, business decisions. That sounds reasonable but many well-known names in the industry are trying, and failing to solve this problem. The struggle lies in the pivot from big data to fast data. The ability to extract meaningful, actionable intelligence from a sea of information and do it quickly. Most of the solutions available are either prohibitively expensive, not scalable, or both.

Dave Bittner: In this episode of CyberWire X, our guests will discuss present and future threats posed by an unmanageable data avalanche as well as emerging technologies that may lead public and private sector efforts through the developing crisis.

Dave Bittner: A program note, each CyberWire X special features two segments. In the first part of the show we'll hear from industry experts on the top at hand. And in the second part, we'll hear from our show sponsor for their point of view. And speaking of sponsors, here's a word from our sponsor, Tanium.

Dave Bittner: To start things off my CyberWire colleague, Rick Howard, speaks with Don Welch from Penn State, about big data, and Steve Winterfeld from Akamai, on data licks. We'll conclude our program with my conversation with our show sponsors, Tanium's Egon Rinderer, for his insights on what successful organizations are doing to channel the data avalanche.

Dave Bittner: Here's Rick Howard.

Rick Howard: Amazon started the Cloud revolution when it rolled out AWS in 2006. Microsoft followed suit with a competing service in 2010 called Azure. And then Google started to compete in the space with Google Cloud Platform, or GPC, in 2012. Somewhere between that time frame and now, it became exceedingly cheap to store everything in the Cloud, compared to how we used to do it in our own data centers managing large, disc farms. And when I say everything, I'm talking petabytes. And just for reference, a petabyte is equivalent to storing just over 13 years of continuously running HDTV video. The crazy part is that some of us are already approaching the storage of exabytes, which is equivalent to the volume of content that Internet users create every day, unbelievable!

Rick Howard: I mention all of this to highlight the fact that since we can save all that data, many of us are. And countless organizations and academia, the commercial space, and government, are pursuing and I'm using air quotes here, "data lake projects". And are either trying to build machine learning algorithms or run statistical analysis on the data to find solutions to real world problems. I thought it was time to bring in some expertise on these projects to see who they were going, to examine if they were doing anything useful, and to consider the security implications of such a giant undertaking. I'm joined by two CyberWire hash table, subject matter experts, old Army buddies of mine, who have been in the fields of IT and security since the world was young. The first is Steve Winterfeld. My best friend in the world by the way, and currently the Akamai Advisory CISO. And second is Don Welch, the former Penn State University CISO and now the interim CIO.

Rick Howard: Don's data lake project is a collaborative research effort with other universities, run by a non-profit called Unison, that is using GCP to store the data. Here's Don.

Don Welch: It's a student success. What we want to do is understand when a student is starting to fall behind in a class, so that a counselor can intervene with them. And also, help the counselors understand what the students chances of success are going in. If you're a Computer Science major and you're taking operating systems and compilers and software engineering in the same semester, we know that's a formula for failure. We look at all the data that we have on them to include their success in previous courses, and correlate that to know, is this a good combination of classes? So we can do a better job of advising students as they go through their course. And then also be able to alert the student and the professor if they are falling behind. They're not spending enough time in their online resources. Whatever data we have, there's a lot of factors to try and help students succeed because, as we know, the big deal with student debt are students who don't graduate. And then they have no degree, but they also have student loans to pay off and that's not good for anybody.

Don Welch: If they come to Penn State, we want them to graduate and we work hard to help them graduate. Unizon is the collaborative and I think we have 22 members right now. A lot of large research universities and we are all putting information into a Unizon database. It's anonymized so that researchers can study it and learn more about how students learn, how students succeed. All different kinds of research problems that you could imagine coming from the student data that's collected. One of the things that's nice about is we store it in a standard format, so that if we build a tool and Iowa says, "Oh wow, that's great, Iowa could use it," and vise versa, we can use a tool that University of Michigan has developed and we can work together that way because it's a common data store.

Don Welch: Our project is not a machine learning project. We want to understand why the decisions are being made. Basically, statistical analysis when we find things that are pertinent, then we will look at them and determine whether or not we put them in the production equation. One of our concerns is implicit bias or unintentional bias that may come out of machine learning. By knowing exactly what we're doing, we want to make sure that we are warning students when they maybe taking on something that will be a problem. But what we don't want to do is discourage students from challenging themselves. That's kind of a fine line to walk. We are not trusting necessarily, blind algorithms to figure that kind of thing out. We're using lots of stakeholders and advisors to make sure that we're walking that line as well as we can.

Rick Howard: Steve's data link project is more of a traditional security vendor effort where Akamai collects telemetry from various products deployed by their customers, as well as collecting data from outside sources. Akamai sells CDN Services which stands for Content Delivery Network Services, as well as other traditional security tools. Here's Steve.

Steve Winterfeld: We have sets of data around our CDN. We have data around our web application firewall and then around our secure web gateway. A lot of that is one centralized database. Other aspects still are separate and so we've got a data-lake and a couple of data-pods. We have data in there from our customers across multiple industries. And then we have data that is from outside our customer experience that's used to reinforce our thread intelligence. Some of this is used for analytics that the customer can do through their interface, some of this is used by our threat researchers looking for trends. And it's used in different ways. So some can be direct query, some we're doing more that machine learning, trying to stay ahead of threat activity.

Rick Howard: In all of these discussions, what comes up a lot are the unique challenges to big data problems that you don't see anywhere else. Here's Steve.

Steve Winterfeld: One of the things that is interesting is where you're trying to manage across multiple, iterations of data collectors. We have every customer with very deep ability to customize what they're doing. Are they monitoring, are they managing, are they stopping stuff? Did they configure something to get a huge set of false positives, which they're not interested in cleaning up, because that's behind them? You know, what is that quality of data? As you put out a push, and you have a new capability, you have a line in the sand where the data is going to be different. But as you go to your big data lake, you generally don't think in terms of, when did a push go out, when did customers start changing configurations? And so I think that within the security field is one of the things that makes it a lot more complex.

Rick Howard: Don's challenges come from managing a massive IT project and designing for the longterm. And also corridoring across many different stakeholders who may or may not have the same goals that he does. These are not bad goals, just different.

Don Welch: You know, there's a lot of different people who have roles in this. And they're very excited because there's a lot of benefits that come from these, but people accuse Central IT of being slow and being the ones who will always slow things down. But there's a good reason for it. You have to do documentation, you have to do testing, you have to build code that is maintainable. You have to have a decent architecture. Otherwise projects become unwieldy very quickly. And the difference between something you can hack together quickly and something that will stand as an enterprise system for a length of time, is pretty significant and not everybody understands it. And that, I think, is one of the problems is the effort that's required to really do an enterprise, capable project.

Don Welch: We all have slightly different privacy values and security standards and that's one of the reasons for having a privacy team and a security team, to make sure that all the universities who are members of this, will be comfortable with what comes out. That they had a chance to be represented. They trust that people are making the best decisions. Because the people involved in the project are not always the people who can say yes or no. Attorneys get involved, people in other parts of the university maybe concerned and want to make sure that we've taken all the proper steps to do the right things.

Rick Howard: Steve and I have had a running debate for a while now about whether or not you need to collect victim intelligence in your data lake. And therefore open yourself up to all kinds of compliance violations. My argument is that if you just want to stop adversary groups from being successful, the only data you need to collect is the telemetry about how the adversary group traverses the kill chain. You don't need victim data at all, and your automated systems can easily not collect it. Steve takes a different view and remember, he and I are friends, so name-calling is kind of our thing.

Steve Winterfeld: The bigger question is, what data do you really need? Can I get rid of the customer data and focus on the system data and still achieve our mission? When we're looking at adversary data, it's co-mingled with the victim's. And the victim's is where the privacy issue is. How do you pay attention to one without knowing who they were attacking? In a case of fraud where they're coming in and doing account takeover, I do need to know, because I need to notify the customer. I need to backtrack the fraud, I need to refund the money. I think that's such a narrow use case that it's not valid and you're stupid and ugly.

Steve Winterfeld: I think we're paying more attention to scoping and what do we really need. That's where it matters. If you need it, fine. Then you need to think about encrypting it and protecting it and doing all the right things for it, masking it. However you're going to protect it. In the past, we thought of the threat database as not necessarily a privacy risk which is changing.

Rick Howard: At Penn State University, Don missed two main compliance laws he tangles with. FERPA or the Family Educational Rights And Privacy Act and GDPR, or the UK's General Data Protection Regulation. But before I let Don explain the difference between the two, just know that he and I both attended the United States Military Academy, back when dinosaurs ruled the planet. Then professors had the habit of posting your grades complete with name and how poorly you did, right on the wall for everybody to see. One year, I was struggling through a mechanical engineering class and knew that the term-end exam was either going to make me or break me in terms of me having to go to summer school that year. The one positive thing I had going for me, was that my class also had a star football player in it who was struggling as much as I was. Since the Academy didn't have a lot of star football players back then, I knew that there was a good chance that he might pass the course. I didn't so much have to pass in the traditional way, I just had to get a better score than him. Sure enough, at the end of the year, the teacher posted a list of typed cadet names from class, along with their grades, sorted from best to worst, and a thick, red line indicating everybody above, that passed, and everybody below, that didn't. The red line was under the star football player's name so he passed. And my name was the one listed just above his.

Rick Howard: But I digress, according to Don, FERPA doesn't allow that kind of shenanigans anymore.

Don Welch: It's all FERPA information. FERPA is the educational one, so in the old days when we were students, the professor used to put our grades up on the door. If they were a nice person your name wouldn't be on there. Obviously, we were all scarred for life because of that. But, FERPA says you can't do that anymore. So any educational information has to be protected to a certain level. So it's not a really high bar, it's not like HIPA or GLB or PII kinds of things.

Rick Howard: But GDPR is a different matter.

Don Welch: We have lots of international students who could invoke GDPR, the right to be forgotten, or obviously data protection laws, if we had a breach and it was exposed, the EU could sanction us under GDPR. One of the things that we have to determine, if somebody makes a right to be forgotten request, is is this personal actually covered under GDPR? And our legal team will help us determine that. And it's what kind of data would we actually have to remove from our system. And under GDPR there are provisions for things that you need for archival or for operational purposes, do not fall under that right to be forgotten rule. For example, if you came to school at Penn State and you got an F in Cyber Security class, you do not have the right to have that be forgotten. You can't just say, "Hey, take that information out." That is part of the historical record, it's part of our operations. If you were a student at Penn State and you attended theater productions and athletic events and you bought stuff from the Penn State website, you could have yourself removed from there, from that commercial activity, because that's not core to our mission. Making sure that we understand which data is subject to GDPR and which data is not, was an important step to our GDPR compliance program.

Dave Bittner: That's the CyberWire's Rick Howard. He was speaking with Don Welch from Penn State and Steve Winterfeld from Akamai. Coming up next, my conversation with Egon Rinderer from our show sponsor, Tanium.

Egon Rinderer: We were arriving at this point, regardless. I think the last 12 months gave us a little bit of an early warning as to what was happening. And the reason I say that is, we have this convergence of burgeoning technologies right now. And a lot of it is becoming very buzz wordy in the press that we have things like 5G and people talk about AI and Edge computing and non-traditional computing containers and Cloud. What's happening though is that we have these areas of technology that are all sort of coming into their own. And if you look at industry statistics, and there's lots of sources out there whether it's IDC or what have you, there's a couple of predictors that we need to pay attention to. One is that we expect to see a two order of magnitude increase in data producing endpoints, data producing devices if you will. Two order of magnitude increase by 2025. That's not very far away. Each one of those things is generating data. And at at the same time we have things like 5G. And it's not just 5G, there's a whole generational leap forward in wireless technologies that's happening right now, that allows us to not only have all of these non-traditional computing, lots of things out there producing data, can also now be connected, fairly ubiquitously at a very low cost in a way that's not been possible before.

Egon Rinderer: And so you think about that for a moment, right? And then you think about what the past 12 months meant to us in terms of the way that we go about today, data instrumentation, collection, centralization, and analysis. And what happened is, that overnight we saw this huge shift and a pretty large chunk of the total enterprise end points left the enterprise. They went outside the perimeter and they're now remote. Well a lot of the legacy methods and techniques and platforms and tools that we used, for data collection, sort of ceased working or at least worked in a very degraded state when that happened. And when you boil down what we do in the technology world and the way that we make decisions in IT, it's really all about data collection, right? You have to instrument it, you have to collect it. You've got to be able to have accurate data to be able to make good decisions with. And data accuracy relies on some pretty simple tenants like it has to be complete, it has to be timely. And completeness and timeliness are relative terms depending on the types of data that you're working with. But all data has a value life to it. If it's very ephemeral data like for security purposes, that sort of thing, like running processors on end points and things, the value life of that sort of data maybe seconds, literally.

Egon Rinderer: If it's inventory data, things like that, it maybe weeks, it maybe months. It varies but we have to take those things into account. And so the way that we do that, by and large today, is we instrument it statically. In other words, we say these are the things that I want to know, and I'm going to build a system of collection for those things and I'm going to put it somewhere and then I'm going to analyze it once it's all centralized. I've got to gather it up first and I've got to make sure I gather everything I could possibly need, put it in one place and then I can do interesting things with it and make decisions based on it.

Egon Rinderer: Well, we have these other tenants in data, so the concepts of volume and velocity. So the first is voracity, it has to be timely, it has to be complete. That feeds voracity. The volume and velocity part is what we're now facing as an industry. So we're going to see this massive leap forward in volume and this huge leap forward in the velocity of data that we have to deal with. And that breaks the old model. It negates the ability to say, I'm going to statically instrument all of this stuff and I'm just going to centralize it all. And then figure out what to do with it, because now suddenly, you've got way more than you could ever hope to centralize. Much more than is valuable if you, again, go to an IDC statistic, five percent of the data today and forget about what we're heading into, but five percent of the data today that is collected will ever be accessed again. And you think about that for a second, think about the cost between infrastructure and people and resources that goes into data collection and retention today. And we, we only ever make sure of five percent of it. And now extrapolate that out when we're looking at a two order of magnitude increase and data producers and this huge increase in velocity of data, we simply can't afford to do that. So we have to come up with new and innovative ways to think about how do we go about getting to the data that we need and doing analysis on it and iterating on it, right?

Egon Rinderer: So if I ask a question of my data, that's generally not the end of it. That's typically going to drive the next question, and the next and the next, until I've distilled it down to something that I trust enough that I can take some action on it. Whether it's to fix something, to mitigate something, to replace something, whatever that is. But I have to be able to do all of that quickly enough that the data value life hasn't expired by the time I arrive at my conclusion. And that's where it starts to get really, really sketchy with the way that we do things today. So I think moving forward, we've got to look at how do we iterate on that data without having to first centralize it and put it all into one giant place. And let's only centralize the stuff that's of really high value to us.

Dave Bittner: Help me understand here, because on the volume side of the equation, it strikes me that storage has never been cheaper than it is today. And it seems to be heading in that direction. Does that lead to almost a counterintuitive rat-pack kind of attitude where it's so cheap, I might as well just store everything rather than being careful about whether something is worth storing or not?

Egon Rinderer: That's right. And so the answer is yes, it does. And here's the problem. Go back to what I said about that two order of magnitude increase. Again, we are exceptionally bad at understanding just exactly what that means in terms of data volume. So if you want to look at it just in terms of raw storage capacity, the projected increase in stored data between now and 2025 is 84 times. If you apply Moore'slaw to this, or whatever method you like, if you think about what that means in terms of data produced versus what we'll see in terms of the totality of our ability to store and retain data, it's not going to grow at that same pace. That's the problem, right? If they were growing in parallel with one another, if it was a perfectly matched trend line, we'd be okay. Assuming we could move the data quickly enough to do that. But the fact of the matter is, it's not. We've got to figure out how to distill down before we store. And it's great and it's fantastic for us as an industry that storage does continue to get cheaper and cheaper. It's fully commoditized at this point. But that doesn't mean we want to be thoughtless about the things that we store, because the vast majority of it is noise. And the other thing that we have to take into consideration is that what's important, when I made the comment earlier about let's store what's valuable, well that value proposition changes over time as well.

Egon Rinderer: There may be data out there that I have just flat don't care about right now, but if something happens, maybe there's a breach, maybe there's some sort of event, what have you, suddenly that data may become incredibly valuable to me. And that's why you can't statically instrument any longer, because you have to be able to go back to the well and say the situation has changed, I now need to know this and get access to the high value sub-set of whatever this is and pull it into that central location very quickly, so that you can now take it and count it in your analysis. And so, yes, it's a long answer to your question, the increase in storage is great. It drives bad habits in terms of compelling us to store everything and then DFI the volume of data that we have, rather than putting the importance on the value of the data that we've stored.

Dave Bittner: So what do you recommend then? I mean, what are the options folks have to come at this problem?

Egon Rinderer: So this is something that we, as a company, we've spent 14 years thinking about. And if you look at the core tenants of data instrumentation and collection, you've got the concepts of speed scale and simplicity, and historically you've only ever gotten to have two of those at any given time. And in order to increase one, or improve one, something else has to give. And so you look at that in the modern enterprise today, generally speaking you'll see scale you don't get to pick. Your organization is as big as it is and it's got the growth rate that it has. So if you've got a 100,000 end points you can't do a whole lot about that. You can't make much of an impact in terms of reducing end point count. And if we're honest with ourselves, we know end point count is going to increase, right? So then you've got a balance between speed and simplicity. So how complex of an infrastructure do you want to build? How expensive? Let's call it that, because that's really what it amounts to. How expensive of an infrastructure are you willing to invest in, in order to instrument and collect all that data, to get as much speed as you can?

Egon Rinderer: And so, if you're going after information that has a very, short value life, which that's very common in security space, then you're going to have to have significant infrastructure to be able to gather that data quickly enough to be meaningful. Otherwise by the time you're doing analysis on it, it's too late. You're driving by looking in your rear-view mirror at that point. And so what most companies do, you'll see whether it's their patching technology or whether it's their security technology or their compliance platform, whatever it is, each one of those platforms has it's own infrastructure dedicated to making that platform work, to allowing some top end system to collect data from all of the subordinate systems in the organization. And it costs what it costs and it gets you data as quickly as it can. So in the patching world, often times that can be measured in weeks, right? If there's patches that come out, I think last statistics I've seen are 20 days to get 80 percent patched. Which is sort of the de facto standard, so people don't really bat an eye at it.

Egon Rinderer: The reality though, and this is really what we set out to do when we started Tanium, was we felt like there's a way to have all three. You can get real time at essentially infinite scale, infinite by today's standards and at least midterm future standards, with no infrastructure. You just have to go about it differently, because again, what we said and sort of the core tenant of what our technology does, is leave the data at the point of production, and access it as though it's a large distributed database. Be able to ask a question, get an answer, take that answer to feed your next question and your next and your next, until you arrive at an actionable data point, which you then pivot and take that action immediately. But let's do that whole process, measured in seconds, rather than measured in days or weeks or whatever it takes, via the traditional method. At the end of the day, it's just a communications model but once you have that, now you can start applying that model to doing things that are, by today's standards, very pedestrian. So if it's patching or if it's compliance scanning or what have you, you can start doing those things, measured in seconds or minutes, rather than days and weeks, right?

Egon Rinderer: There's nothing magic about a particular software vertical. Patching is patching, and compliance is compliances. It's just data. What you've got to do is look at applying different data access and different data instrumentation mechanisms to doing the things we've always done, because doing them the way we've been doing them historically is going to break very, very soon. And in a lot of cases, it's already breaking. And I would point to when we saw 80 percent of our workforce go remote, we lost visibility and control of a massive number of end points across the selection of companies and entities out there. And the problem is the systems that were in place that you saw of that legacy methodology, they don't know what they don't know. All they can report on is what they can see. And so the reports still look good, because, hey, the systems I can talk to I'm gathering data from. So we're in good shape.

Egon Rinderer: The reality is, if you have degraded capabilities to instrument and access data on endpoints, when the context of that endpoint changes it leaves the perimeter, it goes from physical to virtual to Cloud to container, whatever, then you've got a real problem on your hands. And what we've said for a long time and what I think we're seeing come to fruition now, is look, you've got to be able to instrument the data and interact with it without having to first centralize it, so that you can then centralize only what it's important. And make really highly accurate, really timely decisions on data that you have absolute confidence in.

Dave Bittner: Egon Rinderer is Global Vice President of Technology and federal CTO at Tanium, the sponsors of this show. Our thanks to Don Welch from Penn State and Steve Winterfeld from Akamai for sharing their expertise. And for Tanium's, Egon Rinderer for providing his insights and for sponsoring this program. CyberWire X is a production of the CyberWire and is proudly produced in Maryland at the start-up studios of DataTribe, where they're co-building the next generation of cyber security start-ups and technologies. Our senior producer is Jennifer Iben, our executive editor is Peter Kilpy. I'm Dave Bittner. Thanks for listening.