AI and Talent Gaps Are Reshaping Network Operations

AI and Talent Gaps Are Reshaping Network Operations

As the landscape of enterprise networking shifts under the weight of artificial intelligence and hybrid cloud architectures, few professionals understand the resulting operational friction as well as Matilda Bailey. A veteran networking specialist with a career dedicated to cellular, wireless, and next-generation solutions, she has spent years observing how infrastructure must evolve to meet the demands of the modern enterprise. In this discussion, we explore the stark reality facing network operations teams today, where traditional management strategies are increasingly coming into conflict with a talent crisis and the relentless arrival of AI workloads. From the decline in strategic success rates to the urgent pivot toward “Day Two” automation, she provides a detailed look at the pressures redefining the role of the network engineer in 2024 and beyond.

We will delve into the critical factors contributing to operational struggles, including the widening gap between tool availability and proactive problem detection, as well as the cultural and technical hurdles of managing multi-cloud environments. Our conversation covers the necessity of AI-driven agentic automation, the specific telemetry needs of GPU clusters, and why successful teams are moving away from simple tool consolidation in favor of deep architectural integration.

The most recent benchmarking data suggests a troubling trend: the number of organizations claiming their network operations strategy is “completely successful” has dropped from 42% just two years ago to only 31% today. Based on your observations, what specific operational pressures are causing this significant decline in confidence?

This decline is a direct reflection of a perfect storm hitting network operations centers simultaneously. We are seeing a massive “support gap” where CIOs are pushing for AI transformation, yet they are not giving network operators the budget to fill empty seats or the influence they need over modern hybrid and multi-cloud architectures. Currently, nearly 28% of all network problems are still being caused by manual administrative errors, which is a symptom of teams being overworked and stretched too thin. When you combine that with the fact that many existing networks were never actually built to handle the intense, low-latency demands of AI workloads, it is no wonder that confidence is shaking. Operators are essentially being asked to navigate a high-speed digital future using a roadmap and a crew size that belongs to a much simpler era.

A recurring theme in NetOps is the sheer volume of monitoring tools, yet the data shows that 29% of a professional’s day is still lost to troubleshooting. Why hasn’t the presence of four to ten different management tools per organization managed to solve the problem of proactive detection?

The issue isn’t a lack of data, but rather a lack of clarity; we have a chronic case of “tool sprawl” that has persisted for over a decade without significantly improving outcomes. Only about 37% of the alerts generated by these various monitoring tools actually indicate a real, actionable problem, meaning engineers spend their time chasing ghosts or filtering through noise. In fact, despite all these investments, only 58% of network problems are detected proactively before the end-user feels the impact. This inefficiency is exactly why 73% of IT professionals say they are likely to replace a network observability or monitoring tool within the next two years. They aren’t looking for more tools; they are looking for better integration and tools that can actually tell them what is broken before the help desk phone starts ringing.

The talent shortage seems to have reached a tipping point, with 52% of organizations now struggling to hire experts compared to just 26% in 2022. How is this hiring crisis specifically impacting the ability of teams to implement the automation they so desperately need?

It is a bit of a Catch-22 situation because the skills gap itself has become the primary barrier to solving the labor shortage through automation. About 46% of teams cite a lack of internal skills as the biggest hurdle to automating their workflows, followed closely by 36.4% who struggle with tool limitations or a lack of integration. I recently heard from an architect at a Fortune 500 company who mentioned that their management expects a ten-person team to handle a workload that previously required twenty-five people. Without senior-level engineers who understand how to build and maintain automation pipelines for cloud and security, these smaller teams remain stuck in a manual loop. They simply do not have the breathing room to innovate when they are spending nearly a third of their day just keeping the lights on.

There is a noticeable shift in priority toward “Day Two” operations, with 79% of respondents calling it a high priority. Could you explain why the focus has moved away from initial provisioning and toward the ongoing remediation of network problems?

For a long time, the industry focused on “Day Zero” and “Day One,” which were all about getting the network up and running through provisioning and configuration. However, the real pain is in the ongoing detection, triage, and healing of the network in production, which is what we call Day Two. Organizations are now hunting for “agentic automation”—tools that can reason about network conditions and take autonomous action to remediate incidents without human intervention. We are seeing a massive demand for AI-driven insights, with 54.3% of teams prioritizing automated security response and 44.3% looking for self-healing capabilities. The goal is to move beyond simple scripts and into a realm where the network can optimize its own performance and capacity in real-time.

Multi-cloud environments are now the norm, yet only 36% of organizations feel they are completely effective at managing them. What is causing this friction between network teams and cloud engineering groups?

The friction is both technical and cultural, rooted in the fact that every cloud provider has proprietary networking constructs that don’t always talk to each other. We are seeing a lack of feature parity; an observability tool might be excellent at analyzing data from AWS but falls significantly behind when it comes to Google Cloud Platform or secondary providers. This inconsistency in telemetry makes end-to-end visibility across on-premises and cloud environments nearly impossible for the average team. Furthermore, there is often a lack of integrated IP address management, which creates silos between the people building the cloud apps and the people responsible for the underlying connectivity. Until we have a unified abstraction layer that spans all environments, that 36% effectiveness rating is unlikely to improve.

With nearly half of all enterprises already deploying AI training or inference workloads, why is there such a low level of readiness—only 35%—regarding their current observability tools?

AI workloads are fundamentally different from traditional web traffic because they rely heavily on GPU clusters and have incredibly strict requirements for tail latency. Most current tools aren’t equipped to correlate GPU utilization with network performance, which is a specific enhancement that 34.3% of teams are now asking for. There is also a major push for real-time streaming telemetry to replace traditional polling intervals, as 40.2% of professionals realize that minutes-old data is useless when an AI inference job fails in seconds. To close this gap, 51.3% of organizations are looking for AI-powered troubleshooting that can proactively alert them to performance risks before they derail a project. Essentially, the network is becoming the bottleneck for AI transformation, and the tools haven’t yet caught up to the speed of the silicon.

We often hear about the need to consolidate tools to reduce complexity, but your research suggests that successful teams are taking a different approach. What are the high-performing 31% of organizations doing differently than everyone else?

The most successful teams have realized that tool consolidation for the sake of a smaller footprint isn’t actually the answer; instead, they prioritize deep integration. They focus on making sure their tools share data seamlessly and provide unified visibility across both cloud and on-premises infrastructure. These high-performers also hold their observability data to a much stricter accuracy standard to ensure they aren’t wasting time on false positives. They are also early adopters of the Model Context Protocol (MCP), which acts as an abstraction layer across their tool sprawl, allowing AI agents to interact with multiple management systems through a standard interface. By focusing on workflow integration and security insights rather than just reducing the number of logos in their stack, they are able to maintain control despite the increasing complexity.

What is your forecast for the future of network operations over the next three years?

I believe we are entering an era of “Self-Governing Infrastructure” where the role of the network engineer will shift from being a manual troubleshooter to being a curator of AI agents. Within the next three years, the adoption of agentic automation will become the standard, and we will see a significant reduction in that 28% error rate caused by manual tasks. The “NetOps” and “CloudOps” silos will likely merge into a single, unified connectivity team that uses real-time streaming telemetry and AI-driven insights to manage the network as a single entity from the data center to the edge. If organizations can overcome the current talent gap by empowering their existing staff with these reasoning-capable tools, we will finally see those strategic success rates climb back up, even as AI workloads become the primary occupants of our bandwidth.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later