The data center industry is currently navigating a pivotal shift as the surge in artificial intelligence demands unprecedented levels of power and cooling. Matilda Bailey, an expert in infrastructure and next-gen technological solutions, offers a deep dive into the complexities of adapting legacy facilities for modern workloads. Our discussion explores the delicate balance between retrofitting existing structures and committing to brand-new builds, the technical hurdles of high-density cooling, and the evolving requirements of power and networking.
Retrofitting offers a faster, more sustainable path to AI deployment, yet workload requirements are expanding rapidly. Under what specific conditions does a complete rebuild become more cost-effective than a retrofit, and how do you weigh the risk of a new facility becoming obsolete before construction finishes?
A complete rebuild becomes the more cost-effective choice when the gap between a legacy facility’s current capacity and the specific requirements of the intended AI workload is simply too wide to bridge. For example, if a facility’s structural design cannot support the weight of modern high-density racks or if the local grid connection is fundamentally insufficient for the massive energy intake AI requires, the cost of incremental upgrades can quickly exceed the price of starting fresh. We also have to consider the long-term viability of the site; if the physical footprint is too constrained to allow for future modular expansions, you are essentially investing in a dead end. To manage the risk of obsolescence during the years it takes to build a new site, we focus on modularity and “future-proofing” the design so that power and cooling can be scaled as AI models evolve. It is a high-stakes gamble, but building from the ground up allows us to integrate advanced features like liquid cooling loops from day one, which are often difficult to back-fit into older buildings.
Large language models create massive heat loads that traditional cooling systems cannot handle. When implementing direct-to-chip cooling or upgrading power infrastructure, what specific technical hurdles do you face, and how do you manage the high upfront costs associated with these high-density hardware deployments?
The primary technical hurdle with direct-to-chip cooling is the sheer complexity of the plumbing and the integration with existing server architecture. Unlike traditional air cooling where you just need to move fans around, direct-to-chip requires a significant overhaul of the internal rack environment and a constant, reliable flow of coolant to the most heat-intensive components. This transition involves a massive upfront capital expenditure, but we justify it by looking at the long-term operational savings and the ability to pack more computing power into a smaller physical footprint. To manage these costs, we often take a phased approach, upgrading specific “AI zones” within a data center rather than the entire facility at once. This allows us to prove the ROI of high-density hardware before committing to a full-scale infrastructure replacement.
Local electrical grids often lack the capacity to supply the extreme energy volumes required for AI model training. What steps can be taken to minimize stranded power within existing infrastructure, and in what scenarios should operators consider the significant expense of on-site power generation?
Minimizing stranded power is about increasing the efficiency of the electrical distribution we already have, ensuring that every kilowatt pulled from the grid actually reaches a server rack. This involves auditing the entire power chain to identify bottlenecks where energy is being lost or underutilized, which can provide incremental capacity gains without a total overhaul. However, when the local grid simply cannot keep up with the demand of a large-scale training cluster, we have to look at on-site power generation. This is a significant expense and a logistical challenge, but it becomes necessary in remote areas or in regions where the utility provider has a multi-year backlog for high-voltage connections. It provides the operator with a level of energy independence and reliability that is critical when you are running models that consume more power than a small city.
Standard server racks often struggle with the density and heat dissipation requirements of AI-specific hardware. How should operators approach reconfiguring room layouts to improve airflow, and how does the proximity to existing enterprise-grade network connections dictate the feasibility of a high-bandwidth upgrade?
Reconfiguring a room layout is one of the more accessible tactics, focusing on optimizing the path of air to prevent hot spots that can throttle hardware performance. We look at widening the aisles or adjusting the orientation of the racks to ensure that the massive amounts of heat generated by GPUs are moved away from the intake vents as quickly as possible. Networking plays a deciding factor here as well; if a facility is already close to existing enterprise-grade fiber and high-performing network hubs, upgrading to the low-latency bandwidth required for AI is much more feasible. If you are in a “connectivity desert,” the cost of running new high-speed lines over long distances can be a dealbreaker for a retrofit. Therefore, we prioritize facilities that have a strong “networking backbone” already in place, as this significantly lowers the barrier to supporting modern AI.
Training a new model requires significantly more energy and lower latency than running inference on a pre-trained one. How does the intended use case change your approach to data center capacity analysis, and what metrics are most critical when determining if legacy hardware can support modern AI?
The distinction between training and inference is the most critical factor in our capacity analysis because the two have vastly different “hunger levels” for resources. Training a model is an intensive, high-energy phase that demands the absolute maximum from your power and cooling systems, whereas inference—running the pre-trained model—is generally more efficient and can sometimes be handled by less specialized environments. When we evaluate legacy hardware, we look at metrics like Power Usage Effectiveness (PUE) and the specific thermal limits of the existing racks to see if they can handle the sustained “sprint” required for training. If the analysis shows that the facility will hit a thermal ceiling or a power cap within minutes of a training run, we know that the site is better suited for inference or needs a major upgrade. It’s all about matching the workload’s metabolic rate to the facility’s ability to feed and cool it.
What is your forecast for AI data center infrastructure?
I believe we are entering an era of “hybrid infrastructure” where the industry will rely on a mix of highly specialized, ground-up “mega-centers” for massive model training and a vast network of retrofitted legacy sites for edge-based inference. We will see a rapid acceleration in liquid cooling adoption, moving from a niche luxury to a standard requirement as rack densities continue to climb. Furthermore, the pressure on the electrical grid will force more operators to become their own power producers, integrating renewable energy and on-site storage to maintain stability. Success will not go to those who just build the biggest facilities, but to those who can most efficiently adapt their existing square footage to meet the specialized, high-velocity demands of the AI era.
