Building a Network Configuration Linter with Batfish
Using Batfish to validate network configurations before deployment — catching routing loops, unreachable subnets, and policy violations without touching a live device.
Contents
The Case for Pre-Deployment Validation
Code has linters. Infrastructure as code has terraform validate. But network configurations? Most teams still validate by deploying to a lab and manually checking. Or worse, deploying to production and hoping.
Batfish changes this. It is an open-source network configuration analysis tool that builds a model of your network from config files and answers questions about it — reachability, routing, ACLs, BGP sessions — without needing a running network.
Setting Up Batfish
Batfish runs as a Docker container with a Python client:
$ docker run -d -p 9997:9997 -p 9996:9996 batfish/allinone
from pybatfish.client.session import Session
from pybatfish.datamodel import HeaderConstraints
bf = Session(host="localhost")
bf.set_network("production")
bf.init_snapshot("/path/to/configs", name="candidate")
Point it at a directory of router configs and it parses them into a network model. Cisco IOS, IOS-XE, Junos, Arista EOS — it handles all the major vendors.
Reachability Checks
The most powerful query is reachability. Given a source and destination, can traffic flow?
result = bf.q.reachability(
pathConstraints=PathConstraints(
startLocation="/10.1.0.0/24/"
),
headers=HeaderConstraints(
dstIps="10.2.0.0/24",
applications=["https"]
),
actions="SUCCESS,FAILURE"
).answer()
for row in result.frame().itertuples():
print(f"Flow: {row.Flow}")
print(f"Action: {row.Action}")
print(f"Traces: {row.Traces}")
This tells you not just whether the traffic arrives, but the exact path it takes — every hop, every interface, every ACL evaluation. If the traffic is denied, it shows you exactly which ACL line dropped it.
Building Lint Rules
With Batfish as the engine, we can define lint rules that run against every config change:
def lint_no_default_route_leak(bf):
"""Ensure default routes do not leak between VRFs."""
routes = bf.q.routes(
network="0.0.0.0/0",
protocols="bgp"
).answer().frame()
violations = []
for _, row in routes.iterrows():
if row["VRF"] != "default" and row["Next_Hop_IP"] == "0.0.0.0":
violations.append(
f"{row['Node']}: default route in VRF {row['VRF']}"
)
return violations
def lint_bgp_sessions_established(bf):
"""Verify all configured BGP sessions can establish."""
sessions = bf.q.bgpSessionStatus().answer().frame()
violations = []
for _, row in sessions.iterrows():
if row["Established_Status"] != "ESTABLISHED":
violations.append(
f"{row['Node']}: BGP session to {row['Remote_Node']} "
f"is {row['Established_Status']}"
)
return violations
def lint_unused_acls(bf):
"""Find ACLs that are defined but not applied to any interface."""
refs = bf.q.unusedStructures().answer().frame()
violations = []
for _, row in refs.iterrows():
if "acl" in row["Structure_Type"].lower():
violations.append(
f"{row['Source']}: unused ACL '{row['Structure_Name']}'"
)
return violations
CI Pipeline Integration
The lint rules plug into a CI pipeline. When an engineer opens a pull request with config changes, the pipeline:
- Spins up a Batfish container
- Loads the candidate configs
- Runs all lint rules
- Posts results as PR comments
- Blocks merge on any critical violations
# .gitlab-ci.yml
config-lint:
stage: validate
image: python:3.11
services:
- batfish/allinone
script:
- pip install pybatfish
- python scripts/lint_configs.py configs/
rules:
- changes:
- configs/**/*
What We Catch
In the first month of running the linter, we caught:
- 3 ACL misconfigurations — rules that would have blocked legitimate traffic
- 1 OSPF area mismatch — two interfaces in different areas that should have been in area 0
- 2 unused ACLs — leftover from decommissioned services, now cleaned up
- 1 BGP route leak — a VRF was importing routes from a route target it should not have been
Each of these would have been a production incident. The linter cost about a week to set up. The math is clear.
Limitations
Batfish models the control plane, not the data plane. It cannot simulate hardware TCAM limits, queuing behavior, or timing-dependent issues. It also requires complete configs — if your TACACS server pushes dynamic ACLs, Batfish will not see them.
For what it does cover — routing correctness, ACL evaluation, BGP policy — it is the best tool available. Think of it as show ip route for configurations that have not been deployed yet.