Are Data Centers Truly ‘Always On’?

Are Data Centers Truly ‘Always On’?

The silent, humming heart of modern society is not found in a government building or a financial capital, but within the climate-controlled, fortress-like walls of the world’s data centers, where our collective digital existence is stored, processed, and served. This intricate web of servers and cables underpins nearly every aspect of contemporary life, from global commerce and critical infrastructure to personal communication and entertainment. We operate on the implicit promise that this digital world is perpetual and unwavering, a utility as reliable as running water. Yet, a series of high-profile disasters has pulled back the curtain on this assumption, revealing a complex and often fragile reality. The notion of a data center being “always on” is less a guarantee and more a high-stakes aspiration, one that is constantly tested by flawed designs, external chaos, and the simple, unpredictable nature of human behavior.

When the Cloud Goes Up in Smoke The Unseen Fragility of Our Digital World

The illusion of digital permanence can shatter in an instant, and two starkly different catastrophes serve as powerful reminders of this vulnerability. In March 2021, a fire tore through an OVHcloud data center in Strasbourg, France, destroying one facility and damaging another. The investigation revealed a chilling irony: a design feature intended for sustainability—a passive cooling system—may have created an airflow that accelerated the blaze, which originated in an uninterruptible power supply (UPS) unit. To compound the disaster, backup generators automatically activated after emergency services cut the main power, dangerously re-energizing the site and hindering firefighting efforts. The OVH fire was a textbook case of an internal failure, where well-intentioned engineering choices created unforeseen and catastrophic liabilities under stress.

In contrast, the Texas winter storm of February 2021 demonstrated that a data center’s most significant threats can lie far beyond its fortified perimeter. Facilities across the state were designed with robust on-site backup generation and enough fuel for approximately 48 hours of autonomous operation. However, the prolonged, statewide freeze paralyzed the region’s infrastructure. Icy, impassable highways made fuel resupply impossible, a critical external dependency that, as one chief operating officer admitted, had been dangerously “undermodeled.” This event exposed a fundamental truth: a data center’s resilience is ultimately bound to the stability of the larger ecosystem it inhabits, including power grids, supply chains, and transportation networks. The promise of uptime ends where the public infrastructure fails.

The Myth of Perpetual Uptime and the Reality of Risk

At the core of these failures lies the concept of the Single Point of Failure (SPOF), a component or dependency whose failure will bring down an entire system. In a world where economies grind to a halt without continuous data availability, the existence of any SPOF is a critical liability. Our societal reliance is absolute; banking, healthcare, logistics, and government services all depend on the seamless operation of digital infrastructure. An outage is no longer a minor inconvenience but a potential catalyst for widespread economic and social disruption. This dependence has cultivated the myth of perpetual uptime, a belief that digital services are infallible.

The hard reality, however, is that failures are not just possible; they are an inherent part of any complex engineered system. Vulnerabilities are often built-in from the start through design compromises, accumulate over time as systems are modified and updated, or are triggered by the unpredictable collision of external events and internal processes. The promise of 100% uptime is an illusion. Instead, the data center industry operates on a model of calculated risk, constantly balancing the astronomical cost of perfect redundancy against the likelihood and impact of a potential failure. Every data center, no matter how sophisticated, represents a collection of accepted risks.

Deconstructing Resilience The Tension Between Engineering and Economics

The blueprint for achieving high availability is well-established, focusing on eliminating SPOFs through layers of redundancy. The primary pillars of this strategy are power, cooling, and connectivity. To ensure uninterrupted power, facilities are designed with dual-redundant feeds from separate utility substations, backed by massive on-site diesel generators and battery-based UPS systems. Similarly, cooling, which is essential to prevent servers from overheating, is managed by multiple, independently powered chiller plants and air handling units. Finally, carrier-neutral facilities offer customers access to numerous telecommunications providers, insulating them from an outage on any single network. As Matt Wilkins, Global Director of Design and Engineering at Colt Data Center Services, has noted, this multi-layered approach is the foundational practice for building a fault-tolerant environment.

Despite these best efforts, the pursuit of absolute, flawless redundancy runs headlong into economic reality. Building a system with no single point of failure is prohibitively expensive, leading to a pragmatic trade-off. During the design phase, engineers conduct exhaustive SPOF assessments to identify every potential weak link. For each one, a critical decision is made: invest the capital to engineer it out with another layer of redundancy, or formally accept the risk. This process is codified by frameworks like the Uptime Institute’s Tier Standard. A Tier IV facility, the highest rating, is designed with a “completely duplicate” infrastructure, meaning an entire power or cooling system can fail without affecting operations. The more common Tier III standard, by contrast, allows for single points of failure during maintenance activities. These tiers are not a guarantee of performance but a shorthand for accepted risk, a certification of a design’s potential that can be undone by operational realities.

Beyond Simple Breakdowns The Anatomy of a Cascading Failure

Modern outages are rarely the result of a single, isolated component breakdown. More often, they are the product of cascading failures, where a series of small, seemingly unrelated faults compound and ripple through a system, leading to a major event. These incidents expose latent vulnerabilities that were missed during initial design and testing, revealing how interconnected systems can interact in unforeseen ways. The 2014 outage at the Singapore Stock Exchange (SGX) is a classic case study in this phenomenon. The event was triggered by a minor malfunction in a diesel rotary uninterruptible power supply (DRUPS), which created a frequency mismatch between the primary and backup power sources.

This seemingly small issue had massive consequences. The downstream static transfer switches, designed to transfer the IT load seamlessly, were not configured to handle this specific out-of-phase condition. The result was a power surge that tripped breakers throughout the primary data center, initiating a system-wide failure. This incident, along with similar DRUPS-related outages at major facilities, underscores a critical lesson from experts like Ed Ansett, Global Director of Data Center Technology and Innovation at Ramboll: modern failures are not linear. They are complex, multi-system events involving a confluence of factors, including synchronization faults, deferred maintenance, and subtle configuration errors that lie dormant until a specific set of circumstances brings them to light.

The Human Factor The Most Unpredictable Point of Failure

While technology and design are critical, there is a clear consensus among industry experts that the greatest threat to data center uptime is not a faulty piece of equipment but the people who operate it. The Uptime Institute, which analyzes hundreds of incidents annually, consistently finds that human error is a primary or contributing cause of outages. Madeleine Kudritzki, Vice President at Uptime Institute, emphasizes that the “spectrum of human error is quite broad,” ranging from simple, unintentional mistakes to conscious and deliberate violations of established procedures. The institute’s records are filled with sobering real-world examples: a security guard accidentally triggering an emergency power-off button, a technician dropping a tool into live equipment, or an experienced engineer skipping safety checks out of overconfidence.

Mitigating this risk requires a fundamental shift in organizational culture, moving from a reactive, blame-oriented mindset to a proactive, systems-based approach. Organizational decisions, such as understaffing or a lack of preventative maintenance, directly contribute to the conditions that make human error more likely. The solution is a sustained, top-down commitment to three key areas. First is a dedication to meaningful, continuous training that goes beyond mere compliance and focuses on building deep procedural knowledge and confidence. Second is adopting a data-driven approach to staffing, ensuring that facilities have enough qualified personnel to perform all necessary operational and maintenance tasks without cutting corners.

Finally, and most importantly, organizations must foster a culture of unwavering procedural discipline. This means actively monitoring compliance and treating every minor incident as a valuable learning opportunity rather than a source of embarrassment. The 2022 electrical incident at a Google data center in Iowa, which caused service disruptions and injuries, served as a stark reminder that no organization, regardless of its resources or technical expertise, is immune to these challenges. The ultimate difference between a near-miss and a catastrophic outage was rarely an extra generator or chiller. Instead, it was the seamless integration of robust design, verified operational controls, and a culture that relentlessly pursued continuous improvement and placed human performance at the very center of its resilience strategy.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later