AWS Rebuilds Its Network for the Age of AI

AWS Rebuilds Its Network for the Age of AI

With AI and cloud computing demanding unprecedented scale, the physical network is where the digital world meets its physical limits. To explore the bleeding-edge innovations happening deep inside the world’s largest data centers, we’re speaking with Matilda Bailey, a networking specialist who focuses on the technologies powering the next generation of hyperscale infrastructure. Our conversation will delve into how exotic technologies like hollow-core fiber are rewriting the rules of data center geography, the custom software needed to manage networks at a scale where traditional algorithms fail, and the deep vertical integration required to shave nanoseconds off latency. We’ll also touch upon the surprisingly complex challenges of physical cabling and a future where liquid cooling becomes the norm for networking gear.

To maintain sub-millisecond latency, availability zones are geographically constrained. How is hollow-core fiber changing this equation, and what are the key cost-benefit trade-offs you evaluate when deciding to deploy this nascent technology in a specific metro area? Please provide some tangible examples.

It’s a fantastic question because it gets right to the heart of physics and finance colliding. For an availability zone to feel like a single, logical facility to a customer, we have to keep that round-trip latency under about half a millisecond. This fundamentally limits how far apart the individual data centers within that zone can be. Hollow-core fiber changes this by transmitting light faster than traditional solid-core fiber, effectively widening the radius we can build in. This gives us immense flexibility, especially in dense metro areas where finding suitable land and power close together is a huge challenge. While the technology is still significantly more expensive, if it’s the only way we can expand in a critical region, it becomes the right trade-off. We’re currently using it very strategically in about five to ten locations where we’ve hit those exact geographic constraints.

As machine learning workloads demand two to three times more bandwidth per server, traditional control planes can fail. Could you walk us through the specific algorithmic limits you hit and how your custom-built control plane achieves sub-second recovery across hundreds of thousands of links?

This is one of the most critical challenges AI has thrown at us. When you’re deploying servers that need two or three times the bandwidth of their predecessors, the sheer number of devices and optical links in the network just explodes. At that scale, traditional control planes, the brains of the network, simply can’t keep up. You hit fundamental algorithmic limits where recovery times after a failure stretch out, and the network’s convergence slows to a crawl. Around 2020, we realized we had to build something new from the ground up, specifically for these massive ML networks. The result is a control plane that can handle hundreds of thousands of links without hitting those performance cliffs. It delivers sub-second recovery from failures and ensures consistent programming across thousands of devices simultaneously. It’s been so successful that it’s now becoming the foundational control plane for all our networks, not just those dedicated to ML.

Building networking hardware in-house for 15 years is a significant commitment. Beyond simplifying the supply chain, what specific operational capabilities does this vertical integration unlock? Please share an anecdote where having your own ASIC and OS allowed you to solve a problem off-the-shelf gear couldn’t handle.

The consistency it provides is the biggest advantage. We use the same essential building block—our own ASIC, form factor, and operating system—everywhere from the top-of-rack switch to the core of the internet backbone. A perfect example is the custom control plane we just discussed. It achieves its speed and scale partly by running some of its logic directly on the networking devices themselves. You simply couldn’t achieve that tight integration with off-the-shelf hardware from a third-party vendor. Operationally, it’s a game-changer. When a problem occurs, we can pull the exact telemetry we need, not just what a vendor chooses to expose. We can automate testing and remediation in ways that are deeply tailored to our environment because we control the entire stack. Every small software improvement we make can be instantly scaled across the whole global network.

Standard time protocols can be off by seconds, which is unsuitable for distributed systems. Can you elaborate on the hardware-based approach your team took to deliver nanosecond-level accuracy? What specific technical challenges did you overcome to make this service a reality for customers like Nasdaq?

In large distributed systems, especially in finance or databases, having an inconsistent sense of time is a recipe for disaster. Standard protocols like NTP are software-based and are susceptible to network variability, leading to inaccuracies that can be as large as seconds. To solve this, we had to go to the hardware. We built a dedicated time network that runs in parallel to our data network. In each data center, we have an atomic clock synchronized to GPS, which acts as the ultimate source of truth. From there, specialized devices distribute a precise timing pulse, and custom hardware on every single server receives that pulse. This allows us to achieve nanosecond-level accuracy in hardware, which translates to microsecond-level precision for applications in software. Making this a reality for customers like Nasdaq, who need to run entire exchanges, meant proving that this architecture could deliver the consistency and ordering guarantees that were previously only possible in highly specialized, on-premises environments.

Hyperscale data centers can have hundreds of thousands of physical links. What are the biggest challenges this creates for cabling, and what specific innovations in cable design, connector technology, and tracking systems have proven most effective for improving deployment speed and long-term reliability?

Cabling at this scale is a massive, and often underappreciated, physical engineering problem. With hundreds of thousands of links in a single building, you run into very real issues with the sheer weight of the copper and fiber, routing it all without creating a tangled mess, and simply deploying it quickly enough to keep up with demand. Maintaining it over the long term is another beast entirely. We’ve invested heavily in innovations to tackle this. Better tracking systems are key, so we know exactly where every single cable goes from end to end. We’ve also driven improvements in cable design to make them lighter and more manageable. But one of the most effective changes has been the move to new connector technologies that can aggregate many individual fibers into a single, robust connection. This dramatically reduces the time it takes to plug everything in and significantly improves reliability by reducing the number of individual failure points.

You’ve noted a future where liquid cooling becomes standard for network devices and optics shift toward co-packaged connectors. What are the primary engineering and operational hurdles to overcome for these transitions, and how will they help optimize for “watts per bit” in next-generation data centers?

These two shifts are all about efficiency, specifically optimizing for “watts per bit”—the energy cost to move data. The first big transition is liquid cooling. As servers get hotter, they’re increasingly liquid-cooled, but having air-cooled networking gear right next to them creates a complex and inefficient thermal environment. Moving network devices to liquid cooling simplifies the data center design and offers superior heat removal. The second shift is in optics. For years, the industry has talked about fully co-packaged optics, where the optical components are integrated directly with the switch ASIC. However, this creates reliability and supply chain nightmares. A more practical middle ground is emerging: co-packaged connectors. This approach moves the electrical connection closer to the ASIC for efficiency gains but keeps the optical engine itself modular and replaceable. This gives us the best of both worlds—better performance and lower power consumption without sacrificing the operational flexibility and supplier diversity that is absolutely critical at our scale.

What is your forecast for multi-cloud networking over the next five years?

My forecast is that it will become increasingly seamless and invisible to the end user. The goal is to get to a point where the network never gets in the way of what a customer wants to build, regardless of where their resources are located. We’ll see continued massive expansion in capacity and bandwidth, driven by AI, which will force even tighter integration between compute, storage, and networking services. Enterprises will demand and receive more sophisticated tools to manage their hybrid environments, but the underlying complexity will be further abstracted away. Ultimately, success means customers don’t have to think about the network at all; it’s just a reliable, high-performance, and secure utility that is always on and always has enough capacity for their next big idea.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later