Skip to content
Home>All Insights>The CrowdStrike IT Outage: 3 key takeaways to prepare for the next big platform disaster

The CrowdStrike IT Outage: 3 key takeaways to prepare for the next big platform disaster

In the aftermath of the CrowdStrike IT outage that severely impacted airlines, media and banks—understood to have been caused by an automated update to the company’s cybersecurity platform—there is no shortage of hot takes on what went wrong, the implications, the costs involved and more. 

As the dust settles, there’s an imperative to look to the future and leave no stone unturned to prepare for the next potential disaster. 

So what avenues are out there for you and your organisation? What can you start thinking about (or even start actioning) today to reinforce your tech and business resilience? 

1. Really dig into your disaster recovery plan

Business continuity demands a working disaster recovery plan, so you have to take it seriously. Even if you have to revert to paper when things go wrong —as many businesses did— this is a better response than “sorry our computers are down.”

As a thought exercise: what would happen if all of your Windows computers failed to boot tomorrow morning? Or what if all your internal comms went down, as happened to Meta (then Facebook) when they suffered a BGP-related outage

For laptops and PCs, most customers need physical access to the machine to resolve this. Do you have sufficient IT staff to manually recover the affected devices? 

Running workstations in the cloud is not necessarily a mitigation for a widespread outage such as this one. There were reports of contention and high latency on cloud providers, as large numbers of customers tried to run the remediation steps. 

When it comes to your disaster recovery plan, take the time to dig as deep as possible. Do you know which systems are most crucial and should be recovered first? Are there clearly defined roles and responsibilities in the case of an outage? 

Now may be as good as time as any to set aside time to stress test your plan via a war game exercise

2. Understand your update and patching strategies

Even a short outage can cost enterprise customers a lot of money, not to mention goodwill – the 2017 AWS outage is reported to have cost Amazon $150 million. This is why most organisations do not allow direct software updates to their production environments without testing, and will maintain their own control panel and/or update servers to allow for this kind of control. The ability to exert this level of control is often crucial for enterprise software.

On the other hand, organisations like CrowdStrike require fast updates to mitigate threats, such as the channel files which define what counts as suspicious behaviour (the update of which caused this issue). It’s important to understand the potential impact of all updates, whether labelled as code or content updates, and whether instigated by your IT teams or your vendors.

Anti-virus systems such as CrowdStrike are deeply integrated into the operating system (in the case of CrowdStrike, as a kernel-mode driver). This is necessary to watch for the kinds of suspicious activity exhibited by malware, and to load early on in the boot process, before any malware has loaded. The flip-side of this deep integration is that a crash is unrecoverable and will take down the entire operating system. 

With such a substantial blast radius, phasing the rollout of updates and the proper use of staging environments is critical. Ideally, you shouldn’t apply them all at once to your production environments. If you apply a bad patch everywhere all at once, you can break all of your computers.

3. Consider serverless offerings

If you are managing a fleet of Windows machines used on-site for your GP surgery receptionists or the screens displaying your flight information at the airport, serverless offerings are not going to help. But for customers managing a fleet of servers running a web application, it’s worth considering offloading more responsibility onto a cloud provider.

With a serverless or containerised platform you’re moving the responsibility for OS-level updates and protection against some threats onto the cloud vendor, leaving you to concentrate on the core functionality. 

For some customers this won’t be cost effective—the cloud providers need to make their margin somewhere —but for customers running smaller workloads which don’t or can’t maintain a large team to manage their infrastructure it’s good to consider.

Cloud providers do have outages, such as the June 2023 AWS us-east-1 outage, and much more recently, the July 2024 Azure outage, but they do remain very rare. Nevertheless it may be worth deploying your critical systems across multiple cloud providers, or at least across multiple regions in a single cloud provider, reducing the risk of downtime and enabling workload flexibility. 

But like most things, there’s no simple answer; multi-cloud deployments bring their own challenges, including vendor management, additional skills and resources, security considerations and increased costs. 

So what next?

A lot more will no doubt come out in the wash in the coming weeks. Until then, the CrowdStrike incident serves as a massive wake up call for everybody. 

Tech infrastructure is business infrastructure – meaning tech and business leaders must use this as a mandate to future-proof their platforms and platform strategy. 


If you’re looking for a friendly (and knowledgeable) ear, drop us a line for some informal support

Contributors

Special thanks to:

Benji Marshall, Technical Lead, Softwire

Seth Bresnett. Technical Principal, Softwire

Digital Engineering

Get expert help with your digital challenges and unlock modern digital engineering solutions.