Today we’re speaking with Matilda Bailey, a networking specialist who focuses on the latest technologies and trends in cellular, wireless, and next-gen solutions. We’ll be exploring the exciting shift toward “agentic NetOps,” where AI agents are no longer just tools but active partners for network engineers. We will delve into how this human-AI collaboration works during critical incidents, the mechanics behind AI-driven diagnosis, and the impressive real-world results organizations are already seeing. We will also discuss the safeguards for automated remediation and what key performance metrics can tell us about the success of these new systems.
The concept of “agentic NetOps” suggests AI will act alongside human engineers, not just assist them. Could you describe a typical scenario of this human-AI collaboration during a network outage and explain how the roles and responsibilities are divided between the engineer and the AI agent?
Imagine a critical application performance alert comes in at 2 AM. In the past, an on-call engineer would wake up, log in, and start the slow, manual process of data gathering. In an agentic NetOps model, the AI agent is the first responder. It immediately begins its investigation, querying the network’s digital twin, pulling live data, and running initial diagnostic checks from an automation library. By the time the human engineer is online, they aren’t starting from scratch. Instead, they are presented with a dynamic network map showing the AI’s entire reasoning process and its probable root cause. The AI handles the exhaustive, time-consuming data collection and initial analysis, while the engineer’s role shifts to oversight, validation, and strategic decision-making. The human validates the AI’s findings and authorizes the final remediation, especially in high-stakes situations.
AI Deep Diagnosis reportedly uses a ReAct framework to both reason and act. Can you walk us through the step-by-step process of how the AI agent uses this framework to diagnose an issue and how it visually presents its reasoning to the IT team on a network map?
The ReAct framework is really the engine behind this intelligence. It operates in a tight, iterative loop of “Reasoning” and “Acting.” When an issue arises, the AI first forms a hypothesis—for example, it might reason that a slow application is due to a routing issue. Then, it acts on that reasoning by querying routers along the application path. It analyzes the results of that action, and if the data doesn’t support the hypothesis, it reasons a new one, perhaps suspecting a firewall misconfiguration instead. Then it acts again by checking the firewall rules. This all happens incredibly fast, and for the IT team, it’s not a black box. The entire process is visualized on a network map. You can literally watch as the AI agent traces its diagnostic path, highlighting the devices it’s interrogating and displaying the data it finds at each step. It’s a transparent thought process, allowing engineers to follow along and trust the conclusion.
Initial testing showed a 90% success rate in handling real-world network issues, which surpassed expectations. What types of complex problems does this system solve so effectively, and what steps are taken when the AI agent is unable to find the root cause on its first attempt?
The 90% success rate was a genuinely stunning result for the teams involved; they went in hoping for maybe 50% effectiveness, so this was a game-changer. The system excels at those frustrating, multi-domain problems that are a nightmare for human engineers to troubleshoot manually—things like intermittent latency issues that span across on-premise data centers and multiple public clouds like AWS and Azure. As for when it doesn’t find the root cause immediately, it’s not a one-shot system. The platform is designed for persistence. If its first line of inquiry hits a dead end, the agent automatically pivots. The underlying ReAct framework enables it to try different diagnostic approaches. It might switch from analyzing device configurations to investigating network traffic paths or checking application health metrics, continuing to iterate until it isolates the problem.
Since changes cause a majority of network incidents, pre-approved auto-remediation is a significant step. Could you detail how this feature works within established change management processes and what safeguards are in place to ensure an automated fix does not cause unintended secondary problems?
This is a critical point because, as we know, somewhere between 50% and 80% of all network incidents are triggered by a change. Pre-approved auto-remediation integrates directly with an organization’s existing change management framework. It isn’t a rogue agent making changes on the fly. Instead, IT teams can define specific, well-understood issues—like a port that needs to be reset or a standard configuration that needs to be reapplied—and build a corresponding automated fix within a runbook. This runbook is then pre-approved through their normal change control process. The key safeguard is that the automation is confined to these specific, pre-vetted scenarios. The system won’t attempt to automatically fix a novel or complex problem it hasn’t been explicitly trained and authorized to handle, preventing a bad situation from becoming worse.
Organizations have drastically cut security patch rollouts from months to weeks and mean time to response from days to minutes. Please share a specific example of how automation achieves these results and what key metrics a company should track to measure such improvements in its own operations.
These results are transformative. I saw a case with a major airline that was struggling with a mean time to response (MTTR) that often stretched into days for complex issues. By implementing this kind of automation, they slashed that down to just 30 minutes, and their next target is to get it to five minutes. Automation achieves this by eliminating the manual data-gathering and validation steps. For security patching, another company went from a four-month rollout for critical vulnerabilities to just two weeks, comfortably inside their four-week compliance window. The key metrics to track are clear: Mean Time to Response (MTTR) is paramount for incidents. For security, it’s the patch deployment time from vulnerability disclosure to full implementation. And for changes, tracking the percentage of incidents caused by changes is crucial; one managed service provider even managed to eliminate them entirely.
What is your forecast for agentic AI in network operations over the next three to five years?
Over the next three to five years, I believe we will see a fundamental transformation away from the linear, incremental improvements we’ve grown used to. The expectation will no longer be about making engineers slightly more efficient; it will be about creating networks that are largely self-managing and self-healing. Agentic AI will become a standard fixture, acting as a true partner that not only diagnoses and remediates known issues but also proactively identifies potential problems before they impact users. This will free up human engineers from the constant firefighting to focus on higher-level strategic initiatives like network architecture, security posture, and designing for future business needs. The network will truly begin to think and act alongside its human managers.
