The CrowdStrike Failure Was a Warning

Digital disaster should not happen so easily.

The CrowdStrike Failure Was a Warning

Crucial systems across the world collapsed on Friday, triggered by one mistake in a single company. The CrowdStrike outage hit banks, airlines, and health-care systems. It may end up being the worst information-technology disaster in history.

This was not, however, an unforeseeable freak accident, nor will it be the last of its kind. Instead, the devastation was the inevitable outcome of modern social systems that have been designed for hyper-connected optimization, not decentralized resilience. We have engineered a world in which tiny, localized errors can cause global crisis. This precarious state of affairs is by human design—and can therefore be undone. But we are currently speeding toward much greater calamities than the CrowdStrike debacle.

There is often a trade-off between maximum optimization and resilience. Consider a rudimentary prehistorical social system, in which many humans lived in small, isolated bands. They would never interact with other groups of humans hundreds, let alone thousands, of miles away. What any single person did would have little to no effect on those living elsewhere. It was an inefficient, basic system—but if one part of the human system failed, few others were affected.

Throughout our advancement as a species, from building empires to building machines, social systems have evolved to be more connected and centralized. Eventually, an emperor or a king could make a decision in a far-flung palace, and it would soon affect the lives of potentially millions of people. By the Industrial Revolution, trade routes and supply lines had become global. Disaster in one region could upend economies far away. This connectivity and coordination produced unprecedented innovation and prosperity. It was efficient. But it also amplified social risk.

[Read: What the Microsoft outage reveals]

In the 21st century, the combination of globalization and digitization has created a landscape characterized by the threat of catastrophic, instantaneous risk. Globalization enables large efficiency gains, as with just-in-time manufacturing, where a product can be assembled from carefully managed links in the global supply chain. But those systems lack resilience. Every link must fit together perfectly; the system falls apart if even one chain breaks. (This fragility became obvious when one boat blocked the Suez Canal in 2021, causing enormous damage to the global economy.)

Similarly, digital connectivity has unlocked significant innovations. But it has also meant that much of the world’s core operations rely on a tiny subset of companies and the software they develop. A few days ago, most people had never heard of CrowdStrike; now it’s impossible to ignore how many of our most basic forms of social infrastructure are stacked on top of sometimes precarious bits of computer code. It should bewilder us all that the structures  governing our lives were just fixed using a method only slightly more sophisticated than “Have you tried turning it off and on again?”

This time, the digital cataclysm was caused by well-intentioned people who made a mistake. That meant the fix came relatively quickly; CrowdStrike knew what had gone wrong. But we may not be so lucky next time. If a malicious actor had attacked CrowdStrike or a similarly essential bit of digital infrastructure, the disaster could have been much worse.

Centuries ago, the philosopher David Hume wrote that we can never be certain that the patterns of the past will remain the patterns of the future. As I argue in my book Fluke, this is especially true in the 21st century. We are gambling more and more of our world on unstable, volatile systems. Worse, we’re gambling with higher stakes in a time of social upheaval and structural change. Can we really trust our species to flawlessly govern unimaginably complex systems—systems we don’t always fully understand—that can be brought down by a single screw-up?

[Read: Whoops! The internet broke.]

CrowdStrike worked like clockwork—until it suddenly didn’t. And when you’re facing catastrophic risk, close to perfect isn’t good enough. Modern societies have discounted the cost of that risk because our current reward systems are geared toward optimization over resilience. Politicians try to deliver short-term improvements, not long-term planning. Nobody gets reelected by investing in a rainy-day fund. Even worse, for the few politicians who nonetheless focus on long-term planning, their opponents might be the ones who get credit for being prepared when the time comes to use the rainy-day fund. Similarly, business leaders can be hired or fired based on quarterly results. (The short-term focus of social systems is one reason climate change is such a thorny problem to solve. It requires immediate investment to avert a global cataclysm—but we won’t ever know which disasters we averted, because there’s only one version of Earth to observe. Who claims credit when a hurricane doesn’t happen?)

Even though the modern quest for optimization has too often made resilience an afterthought, it is not inevitable that we continue down the risky path we’re on. And making our systems more resilient doesn’t require going back to a disconnected, primitive world, either. Instead, our complex, interconnected societies simply demand that we sacrifice a bit of efficiency in order to allow a little extra slack. In doing so, we can engineer our social systems to survive even when mistakes are made or one node breaks down.

In the case of CrowdStrike, it’s an unwise choice to have so much critical infrastructure riding on one company or one batch of digital code. Societies will be less vulnerable if social systems rely on a more diverse digital array of companies, if those companies are required to follow more stringent testing for updates, and if critical infrastructure has more redundancy so that it can continue operating safely even when one component breaks. For the broader set of risks facing global society beyond digital ones, better regulation is essential to ensure fail-safes, backups, stress testing, and decoupling—so that a problem in one node of a system doesn’t bring down everything else. The CrowdStrike debacle is a clear warning that the modern world is fragile by design. So far, we have decided to make ourselves vulnerable. That means we can decide differently too.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow