Home / Security & Performance / Is Networking the Primary Bottleneck for Scaling AI?

Is Networking the Primary Bottleneck for Scaling AI?

Apr 30, 2026 Interview

Kendra HainesNetwork Security Specialist

Matilda Bailey has spent her career at the intersection of high-speed connectivity and next-generation infrastructure, witnessing firsthand how the shift toward artificial intelligence is pushing traditional networking to its absolute breaking point. As organizations transition from small-scale pilots to massive, thousands-of-GPU deployments, the invisible “plumbing” of the data center is suddenly becoming a primary bottleneck. In this conversation, we explore the structural shifts required to support a future where data center capacity is projected to double to 200 GW by 2030, and why the standard tools of the trade are no longer enough to keep these systems humming.

With less than 10% of current infrastructure capable of handling high-density AI loads, how are you managing the gap between existing capacity and the massive demand projected for 2030? What specific architectural trade-offs do you make when a project faces the risk of significant delays?

The reality on the ground is quite jarring because we are staring at a massive 200 GW demand by 2030, yet nearly half of the projects slated for 2026 are already facing potential delays or cancellations. When a project hits a wall, the trade-off usually comes down to a “retrofit versus rebuild” strategy, where we have to decide if we can squeeze AI-dense loads into aging facilities that were never meant to handle that kind of heat or power. It’s a high-stakes game where we might sacrifice some ultimate scaling efficiency just to get the “compute” live in a legacy space. You can feel the tension in the room when we have to tell a client that their existing cooling simply won’t survive the jump to true AI density. We often end up prioritizing immediate, smaller clusters over the ideal “mega-build” just to keep the momentum going while the larger grid catch-up happens.

When integrating new AI clusters with legacy infrastructure, protocol mismatches between RoCE and TCP/IP often create friction at network borders. How do you resolve these collisions at scale, and what metrics do you track to ensure inference workloads don’t disrupt traditional application traffic?

These network borders are where the “spilled coffee” moments happen, as different worlds collide and performance starts to tank without an obvious cause. We see a lot of friction because RoCE is designed for that ultra-low latency, lossless environment, while TCP/IP is the rugged workhorse of the traditional enterprise, and they don’t always speak the same language at high speeds. To manage this, we’ve moved away from looking at simple uptime and started obsessing over tail latency—specifically focusing on the slowest 1% of packets that can stall a whole cluster. We also have to be extremely protective of our “elephant flows,” those massive, long-lived data transfers that can easily drown out the smaller, “mice flows” of standard business applications. If we don’t strictly segment that east-west traffic, a single large-scale inference job can effectively paralyze the company’s internal database queries.

The transition to 800 Gbps interconnects is occurring significantly faster than previous 400 Gbps ramps. What operational challenges does this accelerated deployment pose for your team, and how are you adjusting your procurement and testing cycles to keep pace with this compressed hardware lifecycle?

The move to 800 Gbps is happening at a breakneck speed, with tens of millions of port shipments expected within just three years, which is a much more aggressive curve than we ever saw with 400 Gbps. This compressed timeline means our testing cycles have to be ruthlessly efficient; we no longer have the luxury of an eighteen-month evaluation period before the next jump in speed hits the market. Procurement has become a logistical marathon where we have to secure components while the standards are still practically wet on the page. We are essentially forced into a “build-and-learn” cycle where we deploy early-release hardware in non-critical environments to iron out the bugs before the full-scale AI training clusters go live. It’s an exhausting pace that requires us to have a much tighter relationship with our hardware vendors than we did five years ago.

In large clusters, synchronized “elephant flows” create microbursts that can overwhelm switch buffers and stall progress. How do you mitigate tail latency during these simultaneous data transfers, and what step-by-step protocols prevent a single delayed packet from stopping a multi-thousand GPU training cycle?

When you have thousands of GPUs finishing a compute cycle and hitting the network at the exact same microsecond, it’s like a tidal wave hitting a small dam. That microburst effect can instantly overwhelm switch buffers, and in a synchronized training environment, the entire cluster has to wait for the slowest packet to arrive before it can move to the next step. To mitigate this, we focus on near-zero packet loss protocols and fine-tuning our buffer management to handle those sudden, massive surges in traffic. We also implement sophisticated congestion control mechanisms that can signal the GPUs to throttle back slightly before the buffer actually overflows. It’s a delicate balancing act because a single delayed packet isn’t just a minor glitch—it represents thousands of idle GPUs and thousands of dollars in wasted compute time every second.

Performance often degrades without triggering traditional alerts, leading to “gray failures” and phantom troubleshooting. Why is standard SNMP polling insufficient for detecting these short-lived congestion events, and how are you implementing real-time streaming telemetry to gain better visibility into the transport layer?

Standard SNMP polling is like trying to monitor a high-speed car race by looking at a photo taken every five minutes; you miss all the crashes that happen in between. These “gray failures” are incredibly frustrating because the dashboard stays green, but the users are complaining that the AI models are sluggish or timing out. We are dealing with congestion events that last for milliseconds, which SNMP simply cannot see, leading to hours of what we call “phantom troubleshooting” where the evidence has vanished by the time we look for it. To fix this, we are moving toward real-time streaming telemetry that pushes data constantly rather than waiting to be asked for it. This gives us a high-definition view of the transport layer, allowing us to see those microbursts in the moment and understand exactly why a “healthy” link is actually underperforming.

AI systems operate 24/7 with sustained demand, yet many network operations still rely on manual tickets and fixed maintenance windows. How must NetOps models evolve to reduce alert noise, and what specific automated workflows have you found most effective for managing high-bandwidth, east-west traffic?

The old way of doing things—where a ticket is opened, a human reviews it, and a change is scheduled for 2:00 AM on a Sunday—simply doesn’t work for AI workloads that demand 24/7 sustained performance. Alert noise has become a major crisis; our teams are being drowned in notifications that don’t actually point to a root cause, making it impossible to stay ahead of the curve. We’ve had to lean heavily into automated workflows that can dynamically re-route east-west traffic the moment a bottleneck is detected without waiting for human intervention. The most effective shift has been moving toward “intent-based” networking, where we define the performance parameters we need, and the system automatically tunes the fabric to maintain those levels. This reduces the manual burden and allows our engineers to focus on the architectural evolution of the system rather than constantly “fighting fires” at the CLI.

What is your forecast for the evolution of AI networking infrastructure?

My forecast is that the network will shift from being seen as “plumbing” to being recognized as a first-order limiter of AI performance and overall scaling efficiency. We are going to see a total rethink of how data centers are designed, with a move away from general-purpose fabrics toward highly specialized, low-latency interconnects that are purpose-built for GPU-to-GPU communication. The visibility gap we currently struggle with will close as real-time telemetry becomes the standard, but this will also require a new generation of NetOps talent who are as comfortable with data science and automation as they are with routing protocols. Ultimately, the winners in the AI race won’t just be the ones with the most GPUs, but the ones who can actually keep those GPUs fed with data through a network that never sleeps.

Is Networking the Primary Bottleneck for Scaling AI?

Related Publications

Subscribe to our weekly news digest.