An introduction to this article appeared in the monthly Creating Connections newsletter put together by the women of The CyberWire. This is a guest-written article. The views and opinions expressed in this article are those of the authors, not necessarily the CyberWire, Inc.
Misinformation risk and the security community.
The race for vulnerability discovery is an ongoing challenge for cyber defenders. There have been a variety of tools, technologies, and techniques introduced to efficiently identify a system’s weaknesses as well as to prevent other security holes from opening in the future. Vulnerability management arguably stands at the crux of all cybersecurity problems. It drives the security community to participate in open communication about our attack surfaces.
Cyber Threat Intelligence.
Cyber Threat Intelligence (CTI) is public information about security-related events, vulnerabilities, and exploits. It’s the way the security community keeps itself informed about new attacks. CTI can come in a variety of forms, but it’s most often consumed through social media, news sites, and web forums. Many members of the security community, from single SOC analysts to entire corporations, find such information useful for integrating knowledge about new vulnerabilities into the workplace. The upside and downside of CTI are, however, the same: there’s a grand amount of it online. The good news is the plethora of useful information, but the bad news is the analysts’ inability to parse all the available CTI into knowledge.
Combining Artificial Intelligence (AI) with traditional cybersecurity methods has shown a path forward to stronger cyber defense systems that can better identify vulnerabilities and mitigate attacks.
In particular, AI technologies such as word embeddings, have made it a reality to parse and obtain meaning from large amounts of CTI online. A single analyst might be able to research ten CTI artifacts a day, whereas an AI-based cyber defense system trained to crawl the web for CTI articles and extract insights from the crawled data can manage a hundred thousand a day. Essentially, the cyber-defense system will learn about various cybersecurity vulnerabilities and exploits through a corpus of CTI information, which is essentially a text collection of CTI articles. In machine learning terms, this corpus is known as a training test. After training, the model can outsource its accumulated knowledge to cyber-security specific tasks, just as a human analyst would.
There are clearly many benefits to introducing AI-based cyber defense systems to aid an analyst’s workload. However, by introducing intelligent security software, there is a risk that an adversary may use similar methodologies for their own, large-scale malicious purposes.
Hacking AI-based cyber defense systems.
Unfortunately, there are many ways a cyber-defense system can be hacked. My research focused primarily on how the training data used to teach the system about cybersecurity information might be corrupted. Corrupting the training data of a machine learning model is commonly known as a “data poisoning attack.” A data poisoning attack works by forcing the model to learn incorrect inputs chosen by an adversary. For the cybersecurity community in particular, the adversary can utilize a poisoning attack to force the model to learn from incorrect or synthesized CTI. Like a human analyst fed incorrect information, a system learning from incorrect information will suffer major performance delays and fail to catch cyber attacks it ought to have detected.
It is important to understand how to corrupt the corpus of a cyber-defense system in order to brainstorm ways to defend against poisoning attacks. Poisoning attacks come in a variety of flavors, from replacing entire training sets with poisoned data, to seeding trained sets with false data, to corrupting the data in production. For this experiment, we decided to seed the cybersecurity corpus with incorrect CTI information that we generated with a popular AI architecture called the Transformer.
Transformers have a variety of use-cases, including machine translation, storytelling, and question-answering systems. The power of transformers has also been used for such malicious purposes as generating and spreading large amounts of misinformation across the web. The security community is far from secure against the misuse of this technology. The fact that a lot of CTI is consumed through social media, brings the misinformation risk closer than we may think. The misinformation phenomena for the cybersecurity community have been heavily underexplored, and my paper aimed at bringing the risk to light in hopes of building safer systems that defend cybersecurity operations against misinformation.
I used a popular transformer-model called the GPT-2 to generate fake open source intelligence to use later in the simulated poisoning attack. The GPT-2 is popularly used for tasks that require the generation of data, and is publically available to repurpose for specific tasks.
Just like human beings, machine learning models can be taught about new information simply by adding to the general knowledge it already knows. For example, if entry-level cybersecurity professionals don’t have a lot of knowledge about exploits and vulnerabilities, they can easily learn by reading CTI. Similarly, the GPT-2 as a stand-alone model does not know much about cybersecurity. However, by fine-tuning the model with a corpus of CTI, I was able to train the model to not only learn new facts about the cybersecurity domain, but also to generate fake but plausible CTI that could fool even experts in the cybersecurity field. The GPT-2 was fine-tuned on a cybersecurity corpus of vulnerability reports and databases, cybersecurity news articles, and Advanced Persistent Threat (APT) reports.
The implications of fake CTI.
We evaluated the generated fake CTI with traditional machine learning evaluation methods as well as through an in-depth human evaluation study conducted by a group of cybersecurity professionals and threat hunters with between five and thirty years of experience. We provided the group samples of both generated fake CTI and real CTI. We provided a mixed set of samples to the group to classify as “real” or “fake”. The fake CTI chosen in the sample set had incorrect attack vector, product, or adversary information. For this particular study, the human evaluation was integral to determining whether fake CTI could be an issue in operational settings. The results were shocking. Our group rated 78.5% percent of the fake samples as true, and only 36.8% of the total samples correctly.
The fake CTI was used to infiltrate a Cybersecurity Knowledge Graph (CKG), an AI-based cyber defense system that extracts relationships and entities from the cybersecurity corpus. In the simulated poisoning attack, the CKG was extracting information from fake and irrelevant data, which showed tremendous adverse impacts such as returning incorrect reasoning outputs and corrupting dependent AI-based cyber defense systems.
The fake CTI was able to easily fool security experts. It presents a major issue as we move forward with addressing the potential of misinformation affecting the cybersecurity community. Security enthusiasts and researchers of all experience levels are encouraged to better understand ways to defend against poisoning attacks, as well as misinformation in the cybersecurity community in general, before it becomes a major problem for corporations world-wide.
For more information, check out the published research.
About the author:
Priyanka Ranade is a PhD Student in Computer Science at UMBC and a Cyber Software Engineer at Northrop Grumman Corporation, working on developing highly secure, interoperable toolkits that optimize the cutting edge research problems posed by our government agencies. With a focus in cybersecurity, she has worked on a variety of programs and research activities that serve various US based agencies in their data management, artificial intelligence, and Internet of Military Things services. In addition, she is also an adjunct cybersecurity professor at the University of Maryland, College Park. She is dedicated to the mission of defense and hopes to encourage others to join the exciting, and ever-growing workforce.
Contact Information: firstname.lastname@example.org
About Ebiquity Lab:
The UMBC ebiquity Research Group consists of faculty and students from the Departments of Computer Science and Electrical Engineering and Information Systems of University of Maryland, Baltimore County (UMBC), located in Baltimore MD.