Need to know: understanding the CrowdStrike-induced Microsoft outage, impact, response, and lessons learned.

By Brandon Karpf, Executive Editor of N2K CyberWire

Jul 19, 2024

This story is still developing. Check back for major updates.

Need to know: understanding the CrowdStrike-induced Microsoft outage, impact, response, and lessons learned.

An in-depth look at the global Microsoft disruptions caused by a misconfigured CrowdStrike update and the subsequent recovery efforts.

For remediation guidance from CrowdStrike, please visit this resource. There have been numerous reported phishing schemes related to this incident. We recommend only using official guidance provided by CrowdStrike support services.

Key Insights

Global impact. A faulty CrowdStrike update caused widespread IT outages affecting Microsoft Windows systems worldwide, disrupting essential services like airlines, hospitals, and banks.
Response and fixes. CrowdStrike and Microsoft quickly identified the issue and deployed fixes, but recovery is taking time, highlighting the challenges in managing IT infrastructure.
Financial and operational consequences. The outages have led to significant financial losses for affected businesses, grounding flights and interrupting emergency services, emphasizing the need for robust IT management.
Industry reactions. Experts stress the importance of rigorous testing and collaboration between vendors and IT professionals to prevent such widespread disruptions in the future.
Cloud vs. on-premises remediation. Remediating cloud-based systems presents unique challenges compared to on-premises environments due to the current best remediation requiring booting in safe-mode or Windows Recovery Environment to remove the offending file.
BitLocker’s role. Implementing BitLocker in your environment could add complexity to the recovery process, as decrypting disks would require access to the BitLocker Recovery Key, which may have been stored on affected systems.

Analyzing the global Microsoft outage caused by CrowdStrike update.

On July 19, 2024, a critical IT outage linked to a faulty CrowdStrike update caused significant disruptions across the globe, impacting businesses and essential services. The issue primarily affected Microsoft Windows systems, resulting in grounded flights, inaccessible emergency services, and halted financial transactions. This article delves into the details of the outage, its causes, and the broader implications for cybersecurity and IT management.

The incident.

In the early hours of Friday, reports of Windows systems displaying Blue Screens of Death (BSODs) started emerging from Australia. Soon, similar issues were reported in the UK, India, Germany, the Netherlands, and the US. Major services like Sky News, and airlines such as United, Delta, and American Airlines, experienced severe disruptions, leading to a “global ground stop” on flights.

The root cause was traced back to a misconfigured update from CrowdStrike, a prominent cybersecurity firm. The update, intended to enhance security, inadvertently caused widespread system crashes. Engineers from CrowdStrike acknowledged the issue on the company’s Reddit forum, advising affected users on a temporary workaround while a permanent fix was being deployed.

Response from CrowdStrike and Microsoft.

George Kurtz, CEO of CrowdStrike, quickly addressed the situation, confirming that the issue was not due to a cyberattack but a defect in the update. He assured that a fix had been deployed and emphasized that the problem only affected Windows systems, with Mac and Linux remaining unaffected.

Microsoft also responded, acknowledging the issues and working alongside CrowdStrike to resolve them. At the same time, Microsoft was dealing with an unrelated outage of its Azure cloud services, adding to the complexity of the situation.

Impact on businesses and services.

The fallout from the CrowdStrike update was extensive.

Airlines: Over 2,000 flights have been canceled worldwide so far, with passengers facing long delays and manual check-in processes. Airports in Sydney, London, Seoul, and Washington D.C. were particularly affected.
Media: Sky News and other broadcasters experienced downtime, impacting their ability to deliver news.
Healthcare: Hospitals in Germany and the UK reported difficulties in accessing patient records, leading to the cancellation of elective procedures.
Financial Services: Banks like JPMorgan Chase faced delays in processing trades due to employees being unable to log into their systems.

The disruptions highlighted the dependency on IT infrastructure and the cascading effects of a single point of failure.

Challenges of remediating in cloud environments.

Addressing issues in cloud environments presents unique challenges compared to on-premises systems. Cloud platforms like AWS, Azure, and GCP do not support traditional remediation methods such as booting into “safe mode.” Instead, administrators must shut down virtual servers, clone their disks, attach them to another server, and manually remove the offending files before reattaching the disks to the original server. This complex process significantly increases the time and effort required to resolve issues, highlighting the need for robust cloud management practices and contingency planning.

The role of BitLocker in this incident.

BitLocker, Microsoft’s disk encryption technology, is designed to protect data by providing encryption for entire volumes. While BitLocker is crucial for securing data against unauthorized access, it adds another layer of complexity during recovery efforts. In the context of the CrowdStrike-induced outage, BitLocker could have both helped and hindered the response:

Security benefits. BitLocker would ensure that data on the affected systems remained secure, preventing unauthorized access during the recovery process.
Recovery challenges. Accessing the BitLocker Recovery Key is essential to decrypt and manage the disks. If these keys were stored on systems affected by the outage, it would complicate the recovery efforts. Administrators would need to manually decrypt the disks using the recovery key, which could be time-consuming and challenging if the keys were not readily accessible.

This dual nature of BitLocker highlights the importance of balancing security and operational efficiency, especially during crisis situations.

Timeline of CrowdStrike’s response, and current remediation steps.

July 19, 2024, 05:20 UTC: CrowdStrike acknowledged widespread reports of BSODs on Windows hosts, indicating that the issue affected multiple sensor versions. The investigation into the cause was initiated, and a Technical Alert (TA) was promised.

July 19, 2024, 05:36 UTC: CrowdStrike posted the TA, identifying the affected regions as EU-1, US-1, US-2, and US-GOV-1.

July 19, 2024, 06:27 UTC: CrowdStrike Engineering identified a content deployment related to the issue and reverted those changes. They provided workaround steps to mitigate the problem:

Boot Windows into Safe Mode or the Windows Recovery Environment.
Navigate to the C:\Windows\System32\drivers\CrowdStrike directory.
Locate the file matching “C-00000291*.sys” and delete it.
Boot the host normally.

July 19, 2024, 19:30 UTC: Statement from CrowdStrike’s CEO, George Kurtz, apologizing for the outage and detailing ongoing remediation efforts.

July 21, 2024, 21:06 UTC: CrowdStrike announced testing a new remediation technique with customers and advised checking the support portal for updates.

July 22, 2024: Microsoft released a tool to help recover systems affected by the CrowdStrike update, providing instructions for both automatic and manual recovery methods.

Expert commentary.

Cybersecurity experts have weighed in on the implications of this incident. For our full coverage of this incident with expert commentary from Andy Ellis, operating partner at cybersecurity VC firm YL Ventures and prior longtime CISO at Akamai, tune in to today's (July 19th, 2024) episode of the CyberWire Daily podcast.

Evan Dornbush, former NSA cybersecurity expert, noted the increased risk based on human factors, "This is of course a phishing attack opportunity. Don't make a bad situation worse. Only follow recommended instructions direct from your CrowdStrike rep. There will be a lot of misinformation about how to reconfigure your computers or which critical system files to delete."

Chris Denbigh-White, Chief Security Officer at Next DLP, stated, "Today’s widespread outage caused by a faulty CrowdStrike update exposes a critical truth: the fundamentals of cybersecurity, while essential, are rarely straightforward. This incident serves as a wake-up call about the importance of rigorous testing and collaboration."

Jake Williams, former NSA hacker and faculty at IANS Research, observed the unmanaged impact of out-of-band software updates: "This highlights the risks of SaaS-based services taking update cycles out of the hands of systems administrators. Many security teams don't realize that their endpoint protection platforms’ signature updates often themselves contain code... For better or worse, CrowdStrike has just shown why this operating model of pushing updates without IT intervention is unsustainable."

Richard Bird, CSO of Traceable, emphasized the human impact, "The ultimate receiver and victim of bad supply chain management and security is the consumer. Corporations need to remember that technology exists to serve human beings."

Itzik Alvas, CEO of Entro Security, warned of increased vulnerability during such outages, "Companies dependent on CrowdStrike might face operational disruptions, delaying their ability to detect and respond to security incidents promptly."

Shawn Waldman, CEO and Founder of Secure Cyber, identified the massive challenge facing remediation teams, "Many global agencies and large organizations have tens to hundreds of thousands of devices spread out across the globe. These entities often lack the capability to quickly and remotely deploy such fixes... many companies should brace for potential extended downtime, possibly lasting days or even weeks."

Lessons and future directions.

The CrowdStrike-Microsoft outage serves as a stark reminder of the complexities in managing IT infrastructure. The incident highlights several key lessons:

Rigorous testing. Before deploying updates, thorough testing in varied environments is crucial to prevent widespread disruptions.
Vendor collaboration. Open communication between vendors, IT professionals, and end-users is essential to manage and mitigate the impact of potential issues.
Backup and redundancy. Organizations must have robust backup systems and redundancy plans to maintain operations during IT failures.
Cloud management. Effective cloud management practices are vital, including understanding the unique challenges of cloud environments and developing appropriate contingency plans.
BitLocker and key management. Proper management and accessibility of BitLocker recovery keys are essential to balance security and efficient recovery during IT incidents.

The global IT outage caused by a CrowdStrike update disrupted essential services, emphasizing the critical nature of cybersecurity and IT management. As businesses and public services continue to rely heavily on digital infrastructure, the need for rigorous testing, effective communication, and robust backup systems becomes ever more vital. This incident should prompt organizations to reassess their IT strategies to enhance resilience against future disruptions.