CSO Perspectives (Pro) 8.9.21
Ep 54 | 8.9.21

Enterprise backups around the Hash Table.


Rick Howard: In our last episode, I described a personal backup and recovery disaster that I barely escaped out of. And again, I'd like to publicly thank the Best Buy Geek Squad for saving my backside and recovering 20 years of my precious family data.

Commercial: Join the over 5 million already saved. Geek Squad 24 Hour Computer Support Task Force. Call 1-800-geek-squad before it's too late.

Rick Howard: And by the way, I went down a YouTube rabbit hole after a listener turned me on to an entire thread of Geek Squad schemes that were happening about the same time as my personal disaster.

Rick Howard: And we've talked about the dark web on this show before, but this was a very different kind of dark web. Those people were angry, and a little frightening. All I can say is that for my particular experience, the Geek Squad was fabulous. And we'll just leave it at that.

Rick Howard: But we also discussed how there are three different strategies that security practitioners might use to backup their material data and keep it safe. A centralized approach, a de-centralized approach, and a DevSecOps approach. 

Rick Howard: I came to the conclusion in the last episode that most of us would be using some combination of all three of those, but it's been a while since I had to manage this particular resiliency function at scale. It's a good thing that I can call a couple of subject matter experts to the CyberWIre Hash Table to help me out. 

Rick Howard: My name is Rick Howard. You are listening to CSO Perspectives, my podcast about the ideas, strategies, and technologies that senior security executives wrestle with on a daily basis. 

Rick Howard: Joining us at the CyberWire Hash Table today are Jerry Archer, the longstanding Sallie Mae CSO and veteran Hash Table member, and Jaclyn Miller, the NTT CISO, and this is her first appearance at the Hash Table, so I'm grateful that she's agreed to join us. 

Rick Howard: I started out by asking both of them about who owns the backup responsibility in most organizations. Both Jerry and Jaclyn use the acronyms, BC and BCP for business continuity plans, and DR and DRP for disaster recovery plans. So listen for those as we go through this. 

Rick Howard: Jerry, you and I are about the same age. When we were closer to the beginning of our career, say the 1990s or the early 2000s, the IT folks generally handled the backup systems as part of their administrative responsibilities. They got guidance from the business continuity and disaster recovery teams, but CSOs didn't own the responsibility. 

Rick Howard: With ransomware having a moment now, has that shifted the thinking? Is this becoming more of a cybersecurity thing or does it still fall to the business continuity and disaster recovery teams? 

Jerry Archer: In my world, I own the business continuity program for the company. But from a disaster recovery perspective, the infrastructure group owns DR. Now we monitor DR and we test DR as part of our normal testing routine, in conjunction with the infrastructure group, but they own the technology around disaster recovery. They figure out the architecture they want. And again, as a security group, we have input into that and we bless it. But they in fact have to implement it. 

Jerry Archer: In our case, we have multiple availability zones. We have multiple regions in AWS and so forth and so on. So we all agreed that that's the best way to do DR. From there, my group does the business continuity planning. So we do all the business impact assessments, all the business recovery plans, and then we work with the business units or, in our case, it's significant process owners to actually test those each and every year. 

Jaclyn Miller: I think it's largely still within the business continuity and disaster recovery programs. But the idea of shared responsibilities around backups and disaster recovery technologies has definitely shifted. 

Jaclyn Miller: I think we're moving away from a centralized model. Back in the day, we used to have a backup team. They would rotate tapes and, sometimes they'd work with the data center teams, or part of the other security operations, but you have the centralized group that worked on backup technology.

Jaclyn Miller: Now, backup technology is built into other types of technology that are managed, so we're seeing a more holistic IT engineer or cloud engineer that deals with backups. And from the cybersecurity standpoint, we're not responsible for the backup technology, but we are responsible for hooking into it. So thinking about automation and SOAR platforms, those integrations have to be built and that requires, you know, collaboration and obviously giving up some control from that central model of the backup team, you know, has control over all of the backup technologies. I think that concept is really breaking up. There's more of a shared responsibility matrix going forward.

Rick Howard: Both Jaclyn and Jerry talked about business continuity and disaster recovery as two separate and distinct things. I asked them to describe the difference. 

Jaclyn Miller: I'll start with what a disaster recovery plan is to me. For me, that's focused on a specific landscape or application, or maybe even a type of technology that we're running. It is the technical execution of recovering that information asset. It isn't necessarily how the business operates around that. You may have an asset that is so critical to the business that is going to have wide reaching impacts on how the business operates, but that's not always the case. And I think modularizing your technology recovery strategies really helps to make sure that you're getting down to the level of granular detail you need in a crisis when there's a lot going on. And maybe you need to do multiple types of recoveries in parallel or in sequence. 

Jaclyn Miller: A BCP is the wider strategy about what happens when our business business is impacted. How do we operate? How do we communicate with our clients? Who makes the decisions? Appended to the BCP is all of those DRPs, the DR plans, um, that we may need to execute, but depending on the type of impact, we don't always need to execute every single DR plan. 

Jerry Archer: Well, they are two separate things that clearly feed into each other. The business decides what the impact of losing the technology would be. So we work with the business to say, okay, here's the impact if you lost your technology, whatever that technology piece might be for that particular business process, what do you need from a recovery perspective? How long should it take? What kind of recovery point objective might you have? Uh, what's your RTO minimums? Those kinds of things so once we establish that, then nav establishes the criteria for building a disaster recovery architecture. 

Rick Howard: So your responsibility is the planning and assessment of where all the weaknesses are, and then the disaster recovery folks, uh, design and deploy the system to protect against it. Is that right? 

Jerry Archer: Right. And then for example, the there's a recovery plan, which says, given this disaster recovery architecture, How does the business go about recovering? Is there anything in particular that they would have to do to recover from an outage? For example, if the recovery point objective is say one hour, then they have to go through some process to recover that one hour's worth of whatever work they were doing, or it's okay to let it go. One, one way or the other. So for example, if they need all kinds of transactional logs to go back and piece together transactions, for example, for that one hour, then that would be part of their recovery plan. 

Rick Howard: As I said in the last episode, probably the most important component to an enterprise backup program is to test it to see if it works. The last thing you want to do during a crisis is discover that you don't really know how to restore your lost data in a timely manner, because you never tried to do it before. This is way more complicated to do at an enterprise level than it is to do on your home systems. And if you're going to be thorough about it, you need to take the time to run your executive team through various crisis scenarios to see how they react.

Rick Howard: I've been doing this kind of thing for a long time. And I always learn something new when I handle a crisis situation to an executive team. Of course, I try to anticipate what they're going to do and say, but you just never know until you all get into a room and talk about it. These executive exercises don't have to be that complicated either. You don't have to spend an entire week running the executive team through a crisis drill. You can just schedule an hour long lunch meeting with the purpose of discussing a specific scenario. Now, a little advice, it's very important that you provide the food because executives are very busy and you can use the idea of a free sandwich to incentivize their participation. 

Jerry Archer: We do bi-annual disaster recovery tests. We don't do all of the environment all at once, but what we will do is significant portions of it twice a year to cover the entire environment. When I say recover, we'll go to a different region and recover every single application, but we won't bring up production in that region, except for some chunk of it, because we don't want to take too much risk. But we spin it all up, and then we validate that at least the section we're worried about is, is functioning correctly. And then we switch back to the other region. We do that twice a year, so that we're testing everything. 

Rick Howard: So is that a push button kind of thing, like in dev ops, you know, infrastructure is code or is it it's a manual checklist and, uh, you know, the team is going through the list of things that had to be moved and turned on and checked. How are you guys doing that?

Jerry Archer: Well, it's pretty much automatic, right? I mean, there's, there's a fair amount of automation, but there's still, you just can't hope that it all works. Right. Right. 

Rick Howard: No, I mean, yeah. Clearly you have to check it. 

Jerry Archer: Yeah. There's still a need for that human in the loop somewhere to say, "yeah. It's all running, right?"

Rick Howard: Yeah. There's that new thing that we just brought up is that actually doing what we thought it would do? Yeah. 

Jerry Archer: Right. Look, when that happens, it's all hands on deck and everybody's watching everything and making sure that it goes correctly, but for the most part, you know, that's fairly automated. It, you know, and bringing it all up in a different region.

Rick Howard: You guys regularly conduct business continuity exercises with the senior leadership team, right? What does that look like when you do it? How does that work? 

Jerry Archer: We have 64 business resumption plans, BIs and BRPs, the business impact and business resumption plans. Every one of those gets tested every single year. In addition to that, what we generally do is we run, cyber exercises where we bring in all the senior executives and do, independent or independently facilitated cyber exercises where you begin to plan around how you would execute it in a ransomware attack, other kinds of outages, malware, you know, any kind of thing that brings your environment down. How do you recover? You know, what does that mean to the businesses? So, yeah, we do those twice a year as well. 

Rick Howard: Are you driving that from the bottom up, or are the senior leaders they really appreciate that that's what's going on and they are eager to participate in those exercises. 

Jerry Archer: Well, I don't think. There's no CEO that is willing to give up like two hours of his day to go sit in a room and recover. But I mean, clearly there's, you know, a CEO has huge priorities, right? So nobody relishes it, but everyone recognizes the need to do it and they, and they all do it. In our case, there's a regulatory requirement as well, that backs it up. But I would tell you that the support we have is a top down support. Our CEO and all of our executive vice presidents recognize the need to do these kinds of things so that we're prepared in case something happens. 

Jerry Archer: We run a general one, which brings in subject matter expertise in all areas of the company, and then typically what we'll do is we'll run one executive session where we divorce the technology from the equation. I no longer need a cyber security expert in the room. I need decisions being made by EDPs and CEOs and boards of directors and things like that. So, what we try to do is, is level it up to, you know, the 60,000 foot level so that we now get our executives contending with some of  the extraordinarily difficult questions that they have to answer.

Rick Howard: Every time I've run one of those with executives, we learned something new that we didn't know about before. How they think about it, what they are worried about. When you did the last ransomware scenario, is there anything specific you guys learned that you hadn't thought of before you had the exercise?

Jerry Archer: When you have in the room, the chief operating officer, the CIO and the CSO, the discussion that ensues around if ransomware were effective at what it was doing, and then honestly you start having these conversations. Can we recover? How quickly would we ever consider paying? What about law enforcement engagement? How do we do that? What would it mean if that happened? You have to accept how law enforcement would be involved in those situations as well. So if there's any learning, it's that kind of awareness that people start to get around. Okay, we have the FBI involved. We have ransomware. Are we gonna pay? Can we recover? If we recover, what would be, you know, the kind of losses we would have from that recovery? That kind of stuff. Those are all really, really tough questions and, it's not so simple just to say, oh yeah, don't pay the ransom ever. Right. That if it became a question and clearly you can create scenarios, right?

Jerry Archer: You create these edge cases to force people into thinking in a way that gets them oriented to, you know, in the real world, all kinds of bad things happen to you and the best laid plans, you're never going to execute right out of the book.

Jaclyn Miller: We do both the technical fail overs, but we also run through business continuity. And one thing we've started to do is really join our BCP tabletop test with our cyber table top test. Over the last couple of years, we've done both  our ransomware scenario testing as well as our BCP, you know, included that as part of the scenario where we included fail overs. To help our executives understand the types of impacts and the strategies that we can deploy to recover from ransomware, both for ourselves and for our clients when the worst does happen. 

Rick Howard: When you did the ransomware tabletops, did you learn anything that you didn't know about, before the exercise happened?

Jaclyn Miller: We learn new information every time we run ransomware tabletops, and sometimes we get lucky enough to participate in our client's ransomware tabletops as well.

Rick Howard: Oh, interesting. 

Jaclyn Miller: So it's great being a fly on the wall for multiple of these. I think one of the things that, you know, I think there's a speed to ransomware. They don't necessarily understand how fast and also slow it moves across the network. How you end up relying upon and speaking with your network team when ransomware is moving across the environment. 

Jaclyn Miller: Once it's hit every asset and they start to go offline, then it becomes more of a DR exercise. But there's a lot of prerequisites that happen ahead of time. So many of the executives think that they can just pull the trigger and say, execute BCP or DR,you know, 1, 2, 3, and that you can recover. But if you do that without sufficient preparation, without network isolation, then you just end up reinfecting and redoing the ransomware tabletop again, or the, the incident.

Jaclyn Miller: So I think that's a frequent learning that comes out of it. And it's always interesting to see how people expect the taxonomy of a ransomware attack to present itself, versus how it actually ends up working out when we do the exercise. 

Rick Howard: For those exercises, does the topic of paying the ransom come up, either with your clients or with within NTT? Is that something on the table? We've all , no, we won't do that ever. Or w w what's what's the general thought?

Jaclyn Miller: Uh, I think it is still a topic that gets tapped danced around. From a corporate policy standpoint, the answer is often, no, we won't pay it. But when the rubber meets the road, I think a CEO and a CFO are going to reserve the right to reevaluate that position and that decision, depending on the impact. So it's one of those things where it's like, uh, it's like having kids, you can only know it when you do it. 

Rick Howard: That's a great analogy. That's so perfect. 

Rick Howard: While I was talking to Jerry, he outlined this evolution of cloud adoption that I thought captured exactly where the industry is going. He talks about in terms of generations, one, two, and three. The first generation was infrastructure as a service where we all just lifted and shifted our data center workloads to a cloud environment somewhere. The second generation was the adoption of serverless functions. Sallie Mae uses AWS. So Jerry refers to them as Lambda functions, but the generic term is serverless functions. He also refers to a key serverless function metric called MIPS, which stands for how many millions of instructions per second can your computer process. And then the third generation, or gen three is Jerry puts it, is almost a complete reliance on SaaS applications like SAP and Salesforce, and using serverless functions to collect telemetry and metrics. 

Jerry Archer: We are actually what I'd call gen three cloud. Generation one cloud was infrastructure as a service. You pick everything up and you move it to the EC2 instances and you run all your stuff in EC2. Yeah, generation two you divorce yourself from the infrastructure model, and you move to Lambda functions. So now all you're worried about is MIPS. You don't know where it runs. You don't know anything else about it. You just know that it's executing. Right? That's basically Lambda. That's gen two. Now you'd go to the next version, which is what I call gen three. In a gen three world, you're trying to divorce yourself from all of that software development environment and you now running big chunks. So for example, Salesforce, Adobe, other things like that, where you're building your environment, based upon huge services that you're buying from a third party. 

Jerry Archer: In our case, we use Salesforce quite a bit, and then we use a lot of services within Salesforce. And now you put together your system with big chunks. You've essentially divorced yourself from a lot of the app dev world. Now we still have a few EC2 instances. We still have a lot of Lambda functions, but we're now moving more and more into what I call gen three cloud. 

Rick Howard: Well, I will piggyback on that, you know, at the CyberWire, we're just a small startup, right? Most of the services we get are from some 25 different SaaS applications like Salesforce and SAP and those kinds of things. So I totally agree with that. The old school approach though, is we don't trust the multiple providers, but I, I totally agree with you that that's really old school thinking. 

Jerry Archer:  Well, there's always going to be a need for COBOL programmers, right? 

Rick Howard: I can program in COBOL. I've done my, my time in COBOL, right? 

Jerry Archer: I'm pretty. I'm pretty sure once we die, there will be nobody left that knows what COBOL is. 

Rick Howard: That's right. Who could even spell it anymore?

Rick Howard: Jaclyn and Jerry are two seasoned security executives. They understand how complicated it is to establish a robust backup program in any organization. I've mentioned this before on the show, but in one of my old army units, back in the day, we gave Memorial plaques to soldiers who were transferring out to new assignments. It was a nice wooden plaque with a unit's crest attached and the outgoing soldier's name and a word of thanks for their hard work engraved. We also engraved the units unofficial Latin motto "Nihil Facile est," rough translation: nothing is easy.

Rick Howard: I'm sure my army bosses wouldn't have appreciated that translation if they knew what the Latin meant, but I think it applies here. It's not easy setting up a backup program, but it's essential to improving your organization's resilience capability and getting back to our InfoSec first principle, barbecue pit in our backyard, the backup brick sits right next to the encryption brick, and both add strength to the baseline resiliency brick that we've been talking about through this entire podcast series. 

Rick Howard: And, that's a wrap. Next week, we are going to talk about orchestrating the security stack across all of your data islands. You don't want to miss that, but as always, if you agree or disagree with anything I've said or anything, our guests have said on this episode, hit me up on LinkedIn or Twitter and we can continue the conversation there.

Rick Howard: The CyberWire's CSO Perspectives is edited by John Petrik and executive produced by Peter Kilpe. Our theme song is by Blue Dot Sessions, remixed by the insanely talented Elliott Peltzman, who also does the show's mixing, sound design, and original score. And, I am Rick Howard. Thanks for listening.