CrowdStrike is making improvements to error handling and software rollouts.
CrowdStrike has published a post-incident review (PIR) of the catastrophic update that incapacitated 8.5 million Windows machines last week. The detailed review attributes the mishap to a bug in the test software, which failed to properly validate the content update before it was deployed. In response, CrowdStrike is committing to more rigorous testing of its content updates, enhanced error handling, and the implementation of staggered deployments to prevent future incidents.
CrowdStrike’s Falcon software is a crucial tool used by businesses worldwide to defend against malware and security breaches on millions of Windows machines. On Friday, a content configuration update intended to “gather telemetry on possible novel threat techniques” was issued. Regularly scheduled updates are part of Falcon’s operations, but this particular update led to widespread crashes.
The Faulty Update: A Deeper Dive
CrowdStrike typically issues configuration updates in two distinct ways. The first, known as Sensor Content, directly updates the Falcon sensor running at the kernel level in Windows. The second, Rapid Response Content, updates how this sensor detects malware. A seemingly innocuous 40KB Rapid Response Content file was the culprit behind Friday’s widespread crash.
Sensor updates, which include AI and machine learning models, are not cloud-based and aim to enhance long-term detection capabilities. These updates involve Template Types, which enable new detection methods configured by the Rapid Response Content.
Conversely, Rapid Response Content updates are managed via the cloud. CrowdStrike operates its own system for validating content before release, designed to prevent such incidents. Last week, two Rapid Response Content updates, or Template Instances, were released. However, due to a bug in the Content Validator, one of the Template Instances passed validation despite containing problematic data.
While automated and manual testing is standard for Sensor Content and Template Types, it seems Rapid Response Content did not undergo the same level of scrutiny. A previous March deployment of new Template Types had instilled “trust in the checks performed in the Content Validator,” leading to an assumption that the Rapid Response Content would be issue-free.
This oversight resulted in the sensor loading the faulty Rapid Response Content into its Content Interpreter, triggering an out-of-bounds memory exception. “This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD),” explains CrowdStrike.
Learning from the Mistake: Preventative Measures
To ensure this debacle doesn’t recur, CrowdStrike is ramping up its testing procedures for Rapid Response Content. New measures will include local developer testing, content update and rollback testing, stress testing, fuzzing, and fault injection. Additionally, stability and content interface testing will be applied to Rapid Response Content.
CrowdStrike is also upgrading its cloud-based Content Validator to enhance its scrutiny of Rapid Response Content releases. “A new check is in process to guard against this type of problematic content from being deployed in the future,” the company states.
On the driver side, CrowdStrike plans to “enhance existing error handling in the Content Interpreter,” which is a component of the Falcon sensor. Furthermore, CrowdStrike will adopt staggered deployments of Rapid Response Content, ensuring updates are gradually rolled out to larger portions of its user base rather than an immediate push to all systems. These improvements and staggered deployments have been recommended by security experts in recent days.
Looking Forward: A Commitment to Excellence
The recent incident has served as a critical learning experience for CrowdStrike. By bolstering their testing protocols and deployment strategies, they aim to regain the trust of their users and set a new standard in cybersecurity software reliability. CrowdStrike’s commitment to these improvements reflects their dedication to providing robust security solutions while minimizing risks to their extensive user base.
As CrowdStrike moves forward, they assure users that these new measures will fortify their systems against similar incidents, striving for a future where such disruptions are a thing of the past.