Designing a Network MCP Service for Infrastructure Operations

The IT industry is in a transition that will look obvious in hindsight. Software development is absorbing the biggest impact from AI and LLMs right now — code generation, automated testing, agentic workflows that plan and execute entire features. But this is the first wave, not the only one. Network engineering, security operations, cloud infrastructure, database administration — every discipline that involves interacting with systems through structured interfaces will follow the same trajectory.

The pattern is predictable: first, LLMs learn to generate the artifacts (configs, policies, queries). Then, agents learn to execute workflows end-to-end. Finally, the interaction model shifts — instead of engineers typing commands into CLIs and web consoles, they direct LLM agents that interact with those systems on their behalf.

For network engineering, this raises an immediate question: how do you let an LLM agent interact with production network infrastructure without creating a disaster?

The pieces already exist

I’ve been writing about the adjacent ideas for a while. In The Anatomy of an AI Agent, I described how modern agents function like operating systems — the LLM is the CPU, the agent is the OS, and skills and MCPs are the applications. MCP servers are the peripherals that give agents access to external systems through a standardized protocol.

In Building a System for AI-Assisted Engineering, I outlined how structured context engineering makes AI assistants genuinely useful for complex projects. The same principle applies here — an LLM interacting with network infrastructure needs rich, structured context about the network’s state, constraints, and intent.

The challenge I explored in AI Makes the Easy Part Easier is directly relevant: AI can generate syntactically correct configurations, but the hard part of network engineering is everything around the config — topology awareness, vendor interop, failure domain reasoning. A Network MCP service needs to account for this gap.

And as I argued in The AI Usage Dilemma, the solution is not to keep AI away from infrastructure. It’s to build systems where AI handles the mechanical work while human operators retain control over the decisions that matter. That’s exactly what this design aims to achieve.

What a Network MCP service looks like

The core idea is a vendor-agnostic MCP server that sits between an LLM agent and network infrastructure. The LLM can query device state, generate configuration changes, and propose modifications — but every write operation passes through a human approval flow before touching a device. The YANG data modeling language serves as the contract between the LLM and the network, providing a structured, machine-readable schema that both sides understand.

The architecture breaks down into distinct layers, each with a clear responsibility.

MCP tool layer: read, write, dry-run

The MCP server exposes three categories of tools to the LLM, and the separation between them is the foundation of the safety model.

Read tools are always available and require no approval. These cover everything an operator would do to assess the current state: fetching interface status, BGP neighbor tables, routing information, device inventory, VLAN assignments. The LLM can freely query any read-only operational data. This is safe — reading state doesn’t change state.

Write tools always create a pending approval record. The LLM never executes a configuration change directly. Instead, it generates a structured intent — a proposed change described in YANG-modeled data — and submits it for human review. The intent includes the target device, the configuration delta, and the LLM’s reasoning for the change.

Dry-run tools generate a configuration diff without creating an approval record. These are useful for exploration — the LLM can ask “what would this change look like?” and get back a rendered diff showing exactly what NETCONF would push. No side effects, no approval needed.

This three-tier model means the LLM is useful immediately — it can investigate problems, correlate data across devices, and generate proposed fixes. But it can never change anything without a human saying yes.

YANG model registry: the contract layer

YANG is the data modeling language that makes this work. It defines the structure of network configuration and state data in a machine-readable format. Every major network vendor supports it (to varying degrees), and it’s the foundation of NETCONF, RESTCONF, and gNMI.

The MCP server maintains a YANG model registry with a clear hierarchy:

OpenConfig models are the primary source. These are vendor-neutral, community-maintained models that cover the most common network constructs — interfaces, BGP, OSPF, VLANs, ACLs, QoS. When the LLM wants to configure a BGP peer, it works with the OpenConfig BGP model regardless of whether the target device is Cisco, Juniper, or Arista.

IETF models serve as the secondary source. Where OpenConfig doesn’t cover a use case — certain L2VPN constructs, some routing policy features — IETF RFC-defined YANG models fill the gap.

Vendor-native models are the fallback. Some platform-specific features don’t have a vendor-neutral model. Cisco IOS-XE native models, Juniper JunOS models, Arista EOS models — these handle the long tail of vendor-specific configuration.

NAPALM/Netmiko shim handles legacy devices. Plenty of production networks still run devices that don’t support NETCONF or any model-driven API. For these, the MCP server maps the YANG model to CLI commands via NAPALM (for structured getters/setters) or Netmiko (for raw CLI interaction). The LLM still works with YANG models — the translation to CLI happens at the transport layer.

The key insight: YANG is the contract. The LLM doesn’t need to know whether a device speaks NETCONF, gNMI, or only understands CLI commands. It works with the model, and the MCP server handles the translation.

Approval engine: every write is a pending intent

This is the layer that makes the system safe for production use. Every write operation from the LLM creates a structured intent record that enters an approval workflow.

An intent record captures the full context: who requested the change (which LLM session), what the change is (YANG-modeled configuration delta), why it was requested (the LLM’s reasoning), which device is targeted, and what the rendered diff looks like. Nothing is ambiguous.

Each intent moves through a defined state machine:

Not all changes carry the same risk, so the approval engine classifies each intent:

Low risk — description or interface-label changes, SNMP community rotations. These can be auto-approved based on policy. The operator still sees them in the audit log, but they don’t block the workflow.

Medium risk — adding a static route, modifying an ACL entry, enabling a disabled interface. These require a single operator approval.

High risk — BGP peer changes, OSPF area modifications, VLAN trunk reconfiguration. These require explicit approval plus a confirmation step (“Are you sure? This affects 12 downstream devices.”).

Critical risk — core routing policy changes, MPLS LSP modifications, changes affecting multiple failure domains. These require two-person sign-off — a second operator must independently approve.

The risk classification can be configured per organization based on their change management policies. What’s critical in one network might be medium-risk in another.

Operator UI: the human control plane

The operator interface is where human judgment stays in the loop. It’s a web application with three primary views:

Approval queue — pending intents sorted by risk level and age. Each entry shows the LLM’s proposed change, the rendered diff, the target device, and the reasoning. Operators can approve, reject (with a reason), or request modifications.

Audit log — every action taken through the system, searchable and filterable. Every read query, every write intent, every approval decision, every NETCONF commit. This is the compliance layer — full traceability from LLM request to device state change.

Active sessions — which LLM sessions are currently connected, what they’re querying, whether they have pending intents. Operators can see what the agents are doing in real time and revoke sessions if needed.

The UI uses real-time updates so operators see new approval requests immediately — no polling, no refresh. When an LLM submits a write intent, the operator’s queue updates within seconds.

Notification layer: meeting operators where they are

Not every operator is watching the approval UI at all times. The notification layer ensures the right people see the right changes through the channels they already use.

Slack and Teams — actionable messages for approval requests. An operator can review a diff and approve or reject directly from a Slack message, without opening the web UI. Medium and high-risk changes push to team channels; critical changes go to designated approvers directly.

Email — for audit trail purposes. Every approved change generates an email summary with the full diff, approver identity, and timestamp. This serves compliance and post-incident review.

PagerDuty — for critical-risk changes that need immediate attention. If a critical intent has been pending for longer than the configured threshold, it escalates through PagerDuty to ensure it doesn’t sit in a queue unnoticed.

End-to-end flow

Putting it all together, here’s what happens when an LLM agent needs to make a network change:

The LLM doesn’t block while waiting for approval. It receives the intent ID and PENDING status, then continues with other work — answering queries, investigating other issues, preparing additional changes. When the approval comes through (or is rejected), the MCP server notifies the LLM asynchronously. This non-blocking pattern is important: an LLM session shouldn’t stall because a human hasn’t reviewed a change yet.

Design principles

Five principles drive the architecture:

YANG-first. The data model is the contract between the LLM and the network. The LLM never generates raw CLI commands or unstructured configuration text. It works with YANG-modeled data, and the MCP server handles translation to whatever protocol the target device speaks. This makes the system vendor-agnostic by default.

Read/write separation. Reads are safe and unrestricted. Writes always require human approval. This is a hard boundary, not a configurable option. The LLM can read anything, but it cannot change anything without an operator’s explicit consent.

Trust through auditability. Every interaction is logged — every query, every proposed change, every approval decision, every device commit. If something goes wrong, the audit trail shows exactly what happened, who approved it, and what the LLM’s reasoning was. This is how you build organizational trust in the system over time.

Non-blocking LLM. The approval flow is asynchronous. The LLM submits an intent and moves on. It doesn’t sit in a loop polling for approval status. This keeps the agent productive and avoids wasting compute on waiting.

Legacy-compatible. Not every device has NETCONF. The NAPALM/Netmiko shim ensures the system works with older platforms that only support CLI. The LLM’s interface doesn’t change — it still works with YANG models. Only the transport layer adapts.

Tech stack

The whole thing runs on a single FastAPI application that does double duty — it serves the MCP endpoint via FastMCP and hosts the React/Vite SPA for the operator UI. One process, one deployment. FastAPI handles the approval engine and API routes, while the React app gets built as static assets and served from the same origin. ncclient and pyang handle YANG model parsing, validation, and NETCONF communication. PostgreSQL provides persistence for approval records and audit logs. Notifications go through Slack and PagerDuty APIs. Device inventory can be sourced from NetBox or Nornir, depending on the existing automation stack.

Why this matters

The direction of travel is clear. Network infrastructure already exposes structured, machine-readable interfaces — YANG models, NETCONF, RESTCONF, gNMI. These weren’t designed for LLMs, but they’re exactly the kind of well-defined contract that agents work best with. The tooling exists, the protocols are standardized, and the operational patterns (read state, propose change, validate, commit) map directly onto an agent workflow. LLM agents will interact with network infrastructure because the integration surface is already there — it just needs the right control plane in front of it. The question isn’t whether it happens. It’s whether we build that control plane deliberately, or let it happen without guardrails.

A Network MCP service with YANG-driven contracts, mandatory approval flows, and full auditability gives us the best of both: the speed and breadth of LLM-assisted operations, with the control and accountability that production infrastructure demands. The human operator doesn’t disappear — they move from typing commands to reviewing intents. That’s a better use of their expertise.