Learnings from the CrowdStrike Outage

xkcd cartoon

The recent CrowdStrike bug that shook the world was completely avoidable. The global disruption highlights a critical need for improved reliability engineering practices across organizations. This incident serves as a compelling case study for the use of an immutable infrastructure based design pattern to maintain reliable services at scale.

Mutable vs Immutable infrastructure

Mutable infrastructure refers to machines or servers that can be modified after deployment. Think manually installing an operating system on a machine and downloading all the applications and dependencies.

Immutable infrastructure on the other hand, refers to machines or servers that cannot be modified after deployment. Engineers build an image specifying every version of the software from the operating system to application. These images are then deployed to the machines

Tradeoffs

The mutable infrastructure design pattern means that the system administrators must constantly update the versions of the software running. This can be assisted using tools like Ansible or Puppet to apply changes remotely. However, this also comes with the added possibility of breaking services in an inconsistent state. (think stopping halfway through a software update)

Immutable infrastructure is deployed using a static image that cannot be modified after deployment. This strategy comes with the added complexity of managing images as well as a system to manage the continuous integration and deployment of these images. The most significant improvement for immutable infrastructure is the speed in reverting change. As the images are never deleted, rollbacks can occur quickly requiring no manual effort to bring services back online after a failed upgrade.

Non-Technical Considerations

When considering reliability engineering best practices, Scale and Risk Tolerance come to mind.

There are significant investment costs associated with maintaining reliable systems and the size of the infrastructure is important to consider. For example, a consulting firm with 100 employees doesn’t have the critical mass to make the investment worthwhile.

Next, the tolerance for outages must be low enough that the investment pays dividends. For instance, telecom systems could go down for a day and society would still be able to function (for the most part). Whereas, a nuclear power plant has a much lower tolerance for failure.

Conclusion

This event serves as a wake-up call for organizations to reassess their infrastructure management practices and invest in reliability engineering. By implementing immutable infrastructure companies can significantly enhance their ability to maintain stable and reliable systems in the face of these challenges.