As a veteran networking specialist focusing on the intricate plumbing of next-generation data centers, Matilda Bailey has spent years at the intersection of high-speed interconnects and hardware optimization. With AI workloads now pushing traditional architectures to their breaking point, her expertise in cellular, wireless, and fabric solutions provides a unique lens into how we manage the massive data flows required for modern inference. She understands that the race for AI supremacy isn’t just about who has the most teraflops, but who can feed those processors without drowning in memory costs or latency spikes.
In this discussion, we explore the shifting landscape of data center economics, where high-bandwidth memory has become the primary bottleneck for scaling large language models. We delve into the rising costs of server DRAM, the strategic importance of Compute Express Link (CXL) as a pressure valve for hardware, and the transition toward composable infrastructure. By re-evaluating the memory hierarchy, Matilda illustrates how organizations can move away from rigid, stranded capacity toward a more elastic and cost-effective serving stack.
In many inference workloads, high-bandwidth memory (HBM) limits often lead to underutilized GPUs. How do you distinguish between compute-bound and memory-bound bottlenecks in a live environment, and what specific utilization metrics should teams prioritize to avoid scaling infrastructure unnecessarily?
To truly see what’s happening under the hood, you have to look past simple GPU utilization percentages, which are often deceptive. A GPU might report 90% activity, but if it’s spent 40% of those cycles waiting for data to arrive from the HBM, your expensive silicon is effectively idling in a high-power state. We distinguish these bottlenecks by looking at the arithmetic intensity of the kernels; if you see that your tokens-per-second are flatlining while memory bandwidth is maxed out, you are firmly in a memory-bound scenario. Teams must prioritize monitoring the KV cache fill rate and memory bus saturation because once HBM hits its ceiling, the system usually responds by lowering concurrency or sharding requests. This leads to a massive deterioration in unit economics where you are paying for 100% of the compute but only benefiting from a fraction of its throughput.
KV cache growth during long-context conversations often forces a choice between lowering concurrency or increasing hardware footprints. What are the performance implications of offloading these caches to a tiered system involving CPU RAM or SSDs, and how does this affect tail latency for multi-turn dialogues?
When you move KV caches out of the ultra-fast HBM tier and into CPU RAM or SSDs, you are essentially making a calculated trade-off between capacity and speed. Offloading to CPU RAM can significantly expand your effective context length, but it introduces a latency penalty that becomes very apparent during the “prefill” and “decoding” stages of a conversation. In multi-turn dialogues, this manifests as a spike in p95 and p99 tail latencies, where the first few tokens of a response might feel sluggish as the system fetches the historical context from a slower tier. However, if the alternative is a “running out of memory” error or refusing new connections, this tiered approach allows the system to absorb volatility and maintain service availability. It’s a sensory shift for the user, moving from an instant response to one that has a slight, rhythmic “breath” before the text begins to flow.
Market trends indicate a widening supply-demand gap and rising contract prices for server DRAM through 2026. What strategic shifts should platform architects make now to mitigate these costs, and how can memory efficiency play a larger role in maintaining a sustainable total cost of ownership?
With TrendForce forecasting steep price increases for server DRAM in early 2026, the era of “just adding more sticks of RAM” as a cheap fix is officially over. Architects need to pivot toward aggressive memory compression techniques and more sophisticated scheduling that treats memory as a scarce, controllable asset rather than a background variable. We are seeing a shift where organizations must audit their fleet to identify “stranded” memory—those gigabytes sitting idle in one node while another node is starved—and implement software-defined memory sharing. By focusing on memory efficiency, you can significantly lower your Total Cost of Ownership (TCO) because you aren’t forced to buy additional $30,000 GPU nodes just to get the extra 80GB of HBM they happen to carry. It’s about being surgical with your hardware allocation and ensuring that every byte of DRAM is actively contributing to token generation.
Compute Express Link (CXL) provides a new tier between local memory and storage. How does integrating CXL memory change the way you plan for overflow capacity, and what technical challenges arise when moving from fixed, location-bound HBM to a more elastic memory configuration?
CXL acts as a critical pressure valve that breaks the rigid silos of traditional server architecture by offering a standards-based way to attach memory with near-local latencies. In a traditional setup, HBM is a hard ceiling; if you hit it, your workload crashes or slows to a crawl, but with CXL, you have a high-bandwidth “next tier” that can catch that overflow. The technical challenge is that CXL is not a total “latency eraser,” so you have to design your serving stack to be aware of which data sits in the fast HBM and which can migrate to the CXL-attached pool. It requires a much more intelligent orchestration layer that can handle the transition from location-bound memory to this more elastic, fabric-oriented configuration. You’re essentially moving from a world of fixed buckets to a world of flowing streams, which is much more efficient but requires a deeper understanding of your data’s “warmth.”
Recent hardware standards allow for memory sharing across multiple hosts rather than keeping it stranded. How can teams practically implement data reuse across requests using these fabric-oriented systems, and what does the step-by-step transition to a fully composable AI infrastructure look like for a growing organization?
Implementing data reuse starts with moving toward a shared KV cache architecture where multiple inference workers can tap into a common memory region via a CXL 3.0 fabric. For a growing organization, the first step is usually implementing memory expansion on individual nodes to stop the “out of memory” fire drills, followed by a transition to memory pooling where resources are dynamically allocated. This allows you to serve multiple models or long-running sessions without duplicating the same foundational context data across every single GPU’s local HBM. The final stage is a fully composable environment where your compute, memory, and storage are decoupled, allowing you to scale your memory independently of your GPU count. This transition feels like moving from a rigid, pre-built Lego set to a box of loose bricks; it’s more complex to manage, but the structures you can build are infinitely more aligned with your actual workload needs.
What is your forecast for AI infrastructure efficiency?
I forecast that over the next three years, we will see a “Great Decoupling” where memory and compute are no longer sold or scaled as a single, inseparable unit. As DRAM prices climb and HBM remains scarce, the most successful AI platforms will be those that embrace CXL fabrics to create a fluid, tiered memory architecture that spans from the chip to the rack. We will move away from the brute-force method of adding more GPUs to solve memory bottlenecks and instead adopt a sophisticated, software-defined approach to memory management. This will likely result in a 2x to 3x improvement in infrastructure efficiency, as we finally stop leaving expensive silicon sitting idle while waiting for the data it needs to process. The future of AI isn’t just about faster processors; it’s about a smarter, more elastic hierarchy that ensures no byte of memory—and no GPU cycle—is ever wasted.
