Matilda Bailey is a distinguished networking specialist who has spent years at the intersection of traditional on-premises hardware and the rapidly expanding world of next-gen cloud solutions. As enterprises increasingly transition not just their workloads, but their entire network architectures—including BGP sessions and virtual firewalls—into public cloud environments, the complexity of maintaining oversight has reached a breaking point. This conversation explores the systemic shift of moving foundational networking constructs like VPN terminations and transit gateways to the cloud, the technical hurdles of normalizing telemetry across hybrid paths, and the transformative potential of layering generative AI onto foundational network models to streamline complex troubleshooting.
As enterprises migrate network constructs like BGP sessions and virtual firewalls into the cloud, what visibility gaps typically emerge? How does this shift complicate the task of tracing a connectivity problem from an on-premises branch through to a cloud-hosted application?
When organizations migrate their network infrastructure into the cloud, they often lose the granular control they once enjoyed in their own data centers. We are seeing a massive shift where BGP sessions, virtual firewalls, and VPN terminations are following compute into the cloud, yet the observability tools haven’t always kept pace. The primary gap emerges because network teams are forced to jump between different tools—one for the branch SD-WAN, another for the colocation interconnect, and yet another for the cloud-native VPC. This fragmented view makes it incredibly difficult to pinpoint whether a latency spike is happening at the service provider circuit or within the cloud’s own transit gateway. Without a single correlated view, the simple act of tracing a packet becomes a multi-hour investigation involving different teams and disconnected datasets.
A single network path might involve SD-WAN providers, colocation interconnects, and VPC transit gateways. How can teams effectively stitch these disparate data points together, and what specific metrics should they prioritize to identify where a bottleneck actually resides within that end-to-end trace?
To effectively stitch these points together, you need a platform that can ingest and correlate data from every hop in the journey, from the initial SD-WAN branch to the final cloud-hosted application. Teams should prioritize high-fidelity metrics like VPC flow logs and real-time infrastructure change data from the hyperscalers themselves. It is no longer enough to just look at throughput; you have to monitor the health of specific cloud constructs like Direct Connect gateways and Google Cloud Routers. By consolidating these metrics into a single pane, you can see if a bottleneck is sitting in an Equinix interconnect or if it’s a configuration issue within a virtual VRF. This level of depth is what allows NetOps teams to maintain the same level of AIOps-driven observability they had in traditional data centers.
Normalizing telemetry from legacy SNMP and modern cloud APIs requires a robust data normalization layer. How do you ensure this process remains source-agnostic, and what are the step-by-step technical challenges when correlating structured hyperscaler event data with varied on-premises vendor formats?
The secret to remaining source-agnostic is implementing a data hypervisor that sits between the ingestion layer and the analytics engine. This layer acts as a translator, taking the highly structured, API-based event data from hyperscalers and matching it with the more varied, often messy telemetry coming from on-premises SNMP or streaming sources. One of the biggest technical challenges is the sheer volume of vendor variation in on-premises environments, which is actually much more complex to normalize than cloud data. Cloud providers tend to publish very consistent event streams, whereas legacy hardware might report the same metric in five different ways depending on the firmware version. The normalization process must ensure that by the time data reaches the AI engine, the source is irrelevant, allowing for a seamless analysis of the entire hybrid path.
When hyperscalers publish infrastructure change events, how can organizations distinguish between routine provider maintenance and a configuration error on their own side? Could you share a scenario where correlating these events revealed an unexpected application flow or a hidden network dependency?
Distinguishing between provider-side changes and internal errors requires correlating hyperscaler event streams directly with observed network behavior in real time. If a cloud provider updates a physical link and your latency simultaneously spikes, the platform can flag that as an external dependency issue rather than a self-inflicted configuration error. We have seen cases where an enterprise was performing a migration review and discovered a completely unknown application flow that had never been documented. This revealed a hidden dependency that would have caused a major outage during the migration if it hadn’t been surfaced by correlating the VPC flow logs with the branch telemetry. It’s these “unknown unknowns” that usually cause the most damage during cloud transitions, and having that visibility is a game-changer for risk management.
Foundational AI models are increasingly being used to manage network infrastructure. Beyond improving the user interface, how will layering generative AI onto these models transform day-to-day troubleshooting, and what specific steps should teams take to prepare their data for this shift?
Layering generative AI onto foundational network models transforms troubleshooting from a reactive search process into a conversational, proactive diagnostic experience. Instead of manually querying logs, a network engineer can interact with the system to ask complex questions about traffic anomalies and receive “eye-popping” insights that were previously buried in the noise. To prepare for this, teams must first ensure their data is properly normalized and aggregated through a robust hypervisor layer; otherwise, the AI will provide inaccurate results based on fragmented information. We are moving toward a major transformation where the AI doesn’t just show you a graph but actually explains the root cause of a BGP flap or a firewall bottleneck across multi-cloud domains. Building those foundational infrastructure models first is critical before the generative layer can truly provide value in a production environment.
What is your forecast for multi-cloud network observability?
I expect that over the next few years, the distinction between “cloud networking” and “on-premises networking” will completely vanish in favor of a unified NetOps approach. We are heading toward a fall release of technologies where generative AI will become the primary interface for managing complex global fabrics, making it possible to automate the resolution of connectivity issues that span three or four different providers. As hyperscalers continue to provide more structured telemetry, the ability to correlate that data with third-party tools like virtual load balancers will become the standard requirement for any enterprise. Ultimately, the winners in this space will be the organizations that can achieve a source-agnostic view, where the network is managed as a single, continuous entity regardless of where the physical or virtual hardware actually resides.
