The power of web data in cybersecurity.
Rick Howard: Hey, everybody. Welcome to "CyberWire-X," a series of specials where we highlight important security topics affecting security professionals worldwide. I'm Rick Howard, the chief security officer of N2K and the chief analyst and senior fellow at the CyberWire. On today's episode, my co-host Dave Bittner and I will be discussing the power of web data in cybersecurity - in other words, open-source intelligence. A program note - each "CyberWire-X" special features two segments. In the first part, we'll hear from an industry expert on the topic at hand. And in the second part, we'll hear from our show's sponsor for their point of view. When we come back, Dave and I will be joined at the CyberWire's Hash Table by two subject matter experts to tell us how they think about this kind of open-source intelligence. Come right back.
Rick Howard: Today, we are talking about the public web data domain, which is a fancy way to say that there is a lot of information sitting on websites around the world that is freely available to anybody who has the gumption to collect it and use it for some purpose. When you do that collection, intelligence groups typically refer to it as open-source intelligence, or OSINT. Intelligence groups have been conducting OSINT operations for over a century if you consider books and newspapers to be one source of this kind of information. When U.S. President Harry Truman signed into law the existence of the CIA, the Central Intelligence Agency, in the late 1940s, his idea was that he needed somebody to read the newspapers from around the world and summarize the important parts for a daily brief - OSINT.
Rick Howard: In the modern day, hackers conduct OSINT operations in order to recon their potential victims by collecting things like email addresses, personal information, IP addresses, software versions, network configurations and, if they're lucky, login credentials for websites and social media platforms. The general classification name of the tools that hackers use to perform these OSINT operations is called scraper tools, automated scripts that can scan the victim's website, looking for useful information. And I have to be honest here, when this topic came up, it had never occurred to me that the good guys could use this same kind of OSINT tool to improve the security posture of our organizations or maybe even help contribute to the bottom line of the business. So I asked my good friend Steve Winterfeld to come on to help me understand it. Steve is the advisory CISO for Akamai and a regular guest here at the CyberWire Hash Table. I asked him if he had any experience with these scraper tools for this purpose.
Steve Winterfeld: My first experience, really, with web scrapers was when I was still back at Nordstrom. You know, I wanted certain things to find me. I wanted anything that would amplify what we had to sell to find me. You know, Google wants to find me. And then there were other competitors that I would prefer they not be able to know necessarily what my prices were.
Rick Howard: What was the logic there? Was the impetus to make sure that your websites get seen by all the right search engines so you can sell more stuff? Or was the impetus to protect yourselves from bad guys trying to scrape your website?
Steve Winterfeld: So really, I thought in kind of three groups. I have the desired, which are people that are going to help customers find my products. The second was the frenemies, competitors, people that may be trying to resell mine. And then the final are the cybercriminals.
Rick Howard: OK.
Steve Winterfeld: And so when I think about how I prioritize those, No. 1 is, of course, a great customer experience. And so I want people to find my items. No. 2 is stopping the cybercriminals. And the two most common I think of is coming in and scraping the website itself and doing a mimic site to get people to try to log in to their account. And then when that fails, they log in a second time to the real account page. And finally, on competitors, do I want to feed them false data? Do I want to try to block them? And I don't know how much effort you're willing to put into that.
Rick Howard: Let's go back to the tool itself. We said scraper tool. Can you just talk about how those things worked in the general form? I've never written one or never used one. How do they work?
Steve Winterfeld: We put out a report from Akamai, our threat research group. We talked about the customers in the crosshairs. And we said, the No. 1 thing happening was account takeover, and that was at 42% of that activity. And 39% of activity were web scrapers. And in our analysis for the financial services, a lot of this was trying to come in and just pull an image, HTML image of the web page, so they could go mirror that and make that part of their phishing campaign.
Rick Howard: It's a manual process, or was it - it's a script?
Steve Winterfeld: No, it can be automated.
Rick Howard: Yeah.
Steve Winterfeld: It can be automated. And so it depends on your business model. If you're just attacking one major bank, you can do it manually. And typically for a phishing campaign, you know, you're tailoring it to different organizations, but you can set up a script that would do that for multiple organizations and automate it all.
Rick Howard: I was talking to Brandon Karpf - you know him, one of our operations guys here at the CyberWire - and he needed to do some low-end web scraping for a new product we're going to roll out. And he brought up ChatGPT and said, hey, write me a scraper that I could grab URLs, images and other interesting things in the interface, spit out code that he could run to run that scraper. That's amazing to me. Amazing.
Steve Winterfeld: There are a lot of very easy-to-use tools, you can go - some of them free, some of them paid for. Some may just go harvest emails.
Rick Howard: Yeah.
Steve Winterfeld: And generally, this is all legal. I think Australia's the only one I know that it's illegal to scrape emails. But generally, this is all legal, aboveboard. It can be done for valid marketing purposes, for commercial purposes. And so all of this, you know, depending on what your business model is, is completely legitimate.
Rick Howard: So let's talk about those use cases again, because you were running them down at the beginning of this. One reason you'd run a scraper is you might want to run it against a bad guy site, right? Because let's say you get a bad URL in a phishing message, and you're not sure whether or not it's malicious or just weird. So you could send your scraper out there to see what that was. That'd be one use case of this. Have you heard that before?
Steve Winterfeld: I think theoretically you could. I don't know that's the technique I would use to validate emails if I understood your use case. For me, this is more when I want to know what an organization is doing. I could go scrape that site every day and have another algorithm telling me when any price has changed.
Rick Howard: So competitive intelligence here is what this is for, right? That's what you're doing it for.
Steve Winterfeld: In a shopping model? Absolutely. Yeah, I'm looking for competitive intelligence. I'm looking for change and what new items are being sold, what prices have changed. If you're putting your prices lower, do I want to put my price lower? Do I want to match prices? And it just gives me a lot of that quick intelligence through an automated tool. I mean, this is the same thing - we used to have spiders or bots that were going out, just creating a map of the internet. Now we've stepped one step up from that, and scrapers are saying not only what is out there, but, let me pull it back in and analyze it through different techniques.
Rick Howard: Where you were mentioning prices, you know, in my experiences, I used to work for security vendors. And we never did that to my knowledge. But I could see us going out and scraping a competitor's website just to see what products they had and what they were naming them and how they were referring and how close it was to ours. I mean, I could see all that, too. But you brought up an interesting question. This is a lot of effort - right? - for getting all that done. And is there bang for the buck there, do you think, or is it just kind of navel gazing?
Steve Winterfeld: Let's say you're going to take a trip to Vegas. Are you going to go just straight to your hotel of choice and say, OK, I tend to stay in Marriotts, go log into Marriott and grab a hotel? Or are you going to go to one of these sites, Kayak or Expedia or one of these sites, and say, I want a hotel in Vegas. Well, those have to go scrape all the different hotels in Vegas and bring those prices back. And now so if I'm in a hotel in Vegas, I want to make sure they're pulling my information. And so I want to optimize them pulling mine and getting the data correct and booking a hotel for me. So in this case, not only is it a great business model, it's something that the people you're scraping from are going to try to optimize to make it easy to pull their data.
Rick Howard: So that's the biz intelligence case. What about improving your security posture if you point these things at your own web infrastructure? I could see us using the scraper tool that would find exposed PII that we didn't know was exposed before, because of just the way it presents the information. Is that a valid use case?
Steve Winterfeld: Potentially, yeah. I don't know a lot of people that are using it for that technique. And like I said, there are so many security capabilities. But that would be a valid model. And then, when you get into, how do you block these, you know, some is just looking for automated tools, looking for those bots. And you can block them. You can slow it down, kind of that tar pit thing. You can feed them false data. You can put a captcha in there. There are some things you can do in design that automatically block or slow down or make it difficult to have bots run through. But again, if, you know, if you're a hotel or a retailer, you don't want to design anything that's going to make it difficult for people to interact with you.
Rick Howard: One of the things you were talking about there was something called a scraping switch or a scraping shield, and it's basically text on your web page that says, don't scan me. And it's a gentleman's handshake. There's no enforcement there. It's just letting the search engines know, we don't want you to scan this web page.
Steve Winterfeld: And then there are some things you can do and design that, you know, if a lot of these bots are standardized, based on assuming you're following standard protocols, you can change some of those protocols to make it less successful. And I apologize for this upfront. Akamai - one of the things we do is detect web scraping with our capabilities and give you the choice on how you want to deal with it. Some of your typical web application and API protections will have this as a feature within there, kind of that bot management capability.
Rick Howard: How would you describe Akamai, a content manager in the cloud?
Steve Winterfeld: So Akamai makes internet go faster, so we do that content distribution.
Rick Howard: So a service or subscription you could buy off of content provider like Akamai is stop web scraping if I wanted to do it or stop it for everybody except for Google and other search engines, something like that.
Steve Winterfeld: Correct.
Rick Howard: And then there's other ways you could do it. We were talking about writing your own scraper if you're just trying to do a down and dirty tool. And then there's also third-party tools, commercial tools that do the scraping for you and then present the information in some intelligence way. I'm starting to hear more people use it. Has that been your experience, too?
Steve Winterfeld: Well, yeah. And I think it again goes back to that market of what you're looking for. Retail and hospitality, I think it's becoming fairly common because you do want to make sure people are getting your information and others where your information may be for proprietary, but it still needs to be public facing. Then you see a lot more of the defensive tools, but it's very much a legitimate information gathering. And I think the advantage a lot of the tools is that post-processing. What are you doing with it after you scrape it off? How are you analyzing it? Because that's where the bulk of the code is in my mind.
Rick Howard: Yeah, I think that's where the commercial versions would come in handy, right? Because they're going to spend some time making that look good and making it useful. If I was writing a scraper myself, I would have none of those skills, right? And so we might scrape a lot of data and then it would sit in some database forever. So bottom line here, Steve, would this be something you'd be talking to CISOs about that they should consider? Or is it pretty much down on your priority list?
Steve Winterfeld: So I think this is a subset of the conversation around bots. We have so much automated activity as we moved to APIs that it's just heightening the amount of automated activity we're having. And so first, how are you getting situational awareness of what the bots are doing? And scraping is one activity. Once you kind of have situational awareness of all the different bots, then putting them in those three categories, those three buckets - desired, frenemy and enemy or cybercriminals, and then having a strategy for each one of those optimization all the way to mitigation.
Rick Howard: There's some good stuff in here, Steve. You brought up several things here that I hadn't considered before. The three-bucket idea is a great way to frame the discussion. Next up is Dave Bittner's conversation with Or Lenchner, the CEO of Bright Data.
Or Lenchner: Web data is actually what we're doing right now, and that's also web data. We're recording a podcast which will be available online. What it means is that we have this huge, massive database, probably the largest one in the history of humanity. It's bigger than all of the books in the world, our DNA, whatever. Everything is online. Everything is online. The data on the web is measured in zettabytes today. That means a number with 21 zeroes after it. That's like trying to imagine the size of the universe.
Or Lenchner: So everything that you see on the web is web data. It also have - or at least we differentiate between two types of data, public and non-public. So nonpublic can be the emails and text that we had prior to this recording, for example. Or even more than that, it can be contended that the user intended it to be private. For example, content that you can see only if you log in to a certain web page. The other party is public web data, and that's what we're doing at Bright Data. This is practically everything that you can see without doing any log in. It's the prices of the products. It's the news that you read. It's the ads that you see, everything. So that's web data.
Dave Bittner: So to what degree does this overlap with or relate to this notion of threat intelligence and cybersecurity?
Or Lenchner: Yeah. So cybersecurity, it all happens online, which means that it all somehow relates to web. And as the web is structured from data, everything eventually gets back to - you know, in the most fundamental way, to data. Now, a lot of the areas in cybersecurity are not public. We're talking about areas that - you know, again, the example of that I gave, like an email client that you're trying to hack or do a phishing attack or something like that. That's not public. But I think that you'll be amazed by the things that you can find in the public web that can help a cybersecurity business to operate better.
Dave Bittner: Well, let's go through that together. I mean, what are some of the main things that folks can benefit from when it comes to securing themselves using this sort of information?
Or Lenchner: Sure. So everything that I'll share is obviously real use cases of real customers. And we serve the largest cybersecurity companies in the world but not just that - also, you know, quote-unquote, "regular companies" that have a cybersecurity department in them. And I'll give a few examples. So I think that the most, maybe, easy to understand use case is those companies that are using us to be perceived as a real victim. And I'll explain.
Or Lenchner: So when you want to investigate if an online content - and, again, I'm talking about only on public web data - is malicious in a way - it can be, you know, a link that will take you into a phishing page or some ads that eventually will try to inject some malware into your device. When you want to do that, you don't want to do that looking as the investigator. You know, those bad guys, the hackers that are creating those malwares - they're pretty sophisticated. If they will think that they're being watched by someone who's looking to find that malware, they know how to look naive. They know how to show you the real legitimate content that - let's say again, for the example - that ad is talking about.
Or Lenchner: So here you got a challenge. You need to be able to click on the ad or to investigate the URL and all of the following URLs and redirect in a way that will look as a real potential victim. With our very, very large, probably the largest proxy network in the world - that's one of the products that we have in Bright Data - you're able to identify yourself as a real user and not as someone that is using, you know, an IP address coming from a large data center. And that's one crucial parameter that you need to make sure that you're doing or using in order to be - you know, to have the characteristics of a real user. And we have really the largest security companies, social media networks, operating systems sometimes that are using this proxy infrastructure to protect their users. That's one example.
Dave Bittner: You know, you mentioned the - just the vast amount of data that's out there. How do you go about making that useful, making that actionable? How do you filter the signal from the noise?
Or Lenchner: Yeah. Well, that's - we see a lot of that. So in the most general way - and then, again, I'll share a specific use case. Everything good and bad is out there. You'll be amazed from the things that you can find in the open public web - not talking about dark web or anything like that. I'm talking about, you know, standard for example, classified boards that everyone are using that hides inside of them, without even knowing, very, very bad stuff.
Or Lenchner: Now, there are some amazing companies in the world that have this amazing, innovative technology to scan through all of these records and try to find those, you know, anomalies that can suggest that maybe this is a threat. But the first thing that they need to do to - before analyzing all these records is to be able to extract the records. And this is - again, this is one service that we're given with our data collection tools - just allowing our customers to extract and scrape this public data in huge scale. After they got that in the - you know, in the raw form of the data, changing it from this unstructured data on an HTML page and structuring it into a table that the machine can read, then their machine learning algorithms and AI sometimes can find these threats that they're hiding in there.
Or Lenchner: And - but it's not just that scary things - you know, the malwares and phishing and things like that. We have large brand protection companies using us, and they need web data to protect brands. So it's not always something that will, you know, lock your computer and ask for a ransomware or inject a virus into your computer. Sometimes it's a brand - let's say a fashion brand, for example - that is being literally abused by fake - people who sell fake products of that brand online or partners and resellers that are not accepting and respecting the brand guidelines and selling it underpriced and things like that. So there's many, many dimensions to cybersecurity. We see that web data also serve the more commercial side, such as brand protection, not just the - like, the pure cybersecurity side of that.
Dave Bittner: Yeah. I mean, it strikes me that one of the issues here is just the vast scale of data that's out there and that your average organization simply doesn't have the resources to gather information in the way that a specialized organization such as yours can do.
Or Lenchner: Exactly. I mean, we're talking about today - and it's growing all the time - roughly 14 billion daily requests that are going on top of our platform every single day. And it's not slowing down. It's growing every month. This is massive scale. You know, you can find gold inside this data. You know, if you are a cybersecurity company, you'll find what you need in that data. If you're an e-commerce brand doing something completely else, you'll find things hidden in this data - and, you know, 30 other industries that we're serving. But that's exactly the issue. You know, again, great companies with great talent, really innovating and - are able to find that one single threat in mountains of data. But first, they need to collect and organize these mountains of data. That's what we're helping them to do.
Dave Bittner: And what is the ideal use case here? I mean, does an organization need to be a certain size? Or do they need to have a certain level of maturity where they're in the right position to make use of the - of this type of data?
Or Lenchner: Not really, because that product range is so wide. So practically anyone can use it. If you're, like, a very talented cybersecurity researcher with high engineering skills or if you just need, you know, an Excel sheet with all of the data, you can get everything in between. And we see that also, you know, in the sizes of the companies. For example, we have the - one of the largest banks in North America is using us to both try and search for specific threats against the bank on the public web, but also to use our proxy platform to pen test - to run penetration tests on its own proprietary tools that they're developing to protect their own users when their own users are logging to their bank accounts online and things like that.
Or Lenchner: On the other hand, we have, you know, a 10 - a team of 10 employees that just raised seed funds for their new cybersecurity startup. They can also work with us. So, you know, as long as what you need to operate is data, you just need to be able to collect it. And then, you know, you can do the most amazing things with it. By the way, we were never able as a company - as a data company - to think on a use case 'cause we're only focused on data collection. Then you have this group of talented young people that are building a new startup and came up with this new use case that we never imagined. But it's all the same data. It's all the same data.
Dave Bittner: I would imagine that for a lot of organizations, the first time that they see the sorts of data that you collect, that must be an eye-opening experience to the degree that they didn't know what they didn't know.
Or Lenchner: Oh, definitely. Sometimes they have - they think they know. And they need to validate their theory with data. But, you know, in some instances, we just talk with them. They come up with one idea. That's fine. We give them the data. But then we can help because - you know, we serve, like, 15,000 customers. So we can tell them, hey, we have an interesting company. Maybe you should talk with them if they'll want to. And they're doing something similar. And then they say, oh, my God, with the same data, this is what they're doing? That's unbelievable. So definitely. And we're always surprised also - again, the most amazing things that you can do with one data set, you know? You can just take one standard data set, you'll find 30 different use cases that can - from each, you can build an amazing business.
Rick Howard: We'd like to thank our interview guests Or Lenchner, the CEO of Bright Data, and Steve Winterfeld, the Advisory CISO at Akamai, for helping us think about open-source intelligence. And finally, we'd like to thank Bright Data for sponsoring the show. "CyberWire-X" is a production of the CyberWire and is proudly produced in Maryland at the startup studios of DataTribe, where they are co-building the next generation of cybersecurity startups and technologies. Our senior producer is Jennifer Eiben. Our executive editor is Peter Kilpe. And on behalf of my colleague Dave Bittner, this is Rick Howard, signing off. Thanks for listening.