Table of Contents
Fetching ...

A Network Arena for Benchmarking AI Agents on Network Troubleshooting

Zhihao Wang, Alessandro Cornacchia, Alessio Sacco, Franco Galante, Marco Canini, Dingde Jiang

TL;DR

The paper addresses the lack of open, reproducible benchmarks for evaluating AI-driven network troubleshooting agents. It introduces NIKA, a modular benchmark and orchestration framework that couples a curated incident pool with an end-to-end runtime platform and an Agent Access Layer to test detection, localization, and root-cause analysis capabilities. Through experiments with GPT-OSS:20B, GPT-5-mini, and GPT-5 on 900 traces across multiple topologies, the authors show that larger LLMs improve issue detection but still struggle with precise fault localization and RCA, with performance degrading as network size grows. NIKA enables controlled, repeatable evaluation of agent designs, tool interfaces, and state management, and supports open data and extensible backends to accelerate progress in AI-assisted network operations.

Abstract

Agentic systems, powered by Large Language Models (LLMs), assist network engineers with network configuration synthesis and network troubleshooting tasks. For network troubleshooting, progress is hindered by the absence of standardized and accessible benchmarks for evaluating LLM agents in dynamic network settings at low operational effort. We present NIKA, the largest public benchmark to date for LLM-driven network incident diagnosis and troubleshooting. NIKA targets both domain experts and especially AI researchers alike, providing zero-effort replay of real-world network scenarios, and establishing well-defined agent-network interfaces for quick agent prototyping. NIKA comprises hundreds of curated network incidents, spanning five network scenarios, from data centers to ISP networks, and covers 54 representative network issues. Lastly, NIKA is modular and extensible by design, offering APIs to facilitate the integration of new network scenarios and failure cases. We evaluate state-of-the-art LLM agents on NIKA and find that while larger models succeed more often in detecting network issues, they still struggle to localize faults and identify root causes. NIKA is open-source and available to the community: https://github.com/sands-lab/nika.

A Network Arena for Benchmarking AI Agents on Network Troubleshooting

TL;DR

The paper addresses the lack of open, reproducible benchmarks for evaluating AI-driven network troubleshooting agents. It introduces NIKA, a modular benchmark and orchestration framework that couples a curated incident pool with an end-to-end runtime platform and an Agent Access Layer to test detection, localization, and root-cause analysis capabilities. Through experiments with GPT-OSS:20B, GPT-5-mini, and GPT-5 on 900 traces across multiple topologies, the authors show that larger LLMs improve issue detection but still struggle with precise fault localization and RCA, with performance degrading as network size grows. NIKA enables controlled, repeatable evaluation of agent designs, tool interfaces, and state management, and supports open data and extensible backends to accelerate progress in AI-assisted network operations.

Abstract

Agentic systems, powered by Large Language Models (LLMs), assist network engineers with network configuration synthesis and network troubleshooting tasks. For network troubleshooting, progress is hindered by the absence of standardized and accessible benchmarks for evaluating LLM agents in dynamic network settings at low operational effort. We present NIKA, the largest public benchmark to date for LLM-driven network incident diagnosis and troubleshooting. NIKA targets both domain experts and especially AI researchers alike, providing zero-effort replay of real-world network scenarios, and establishing well-defined agent-network interfaces for quick agent prototyping. NIKA comprises hundreds of curated network incidents, spanning five network scenarios, from data centers to ISP networks, and covers 54 representative network issues. Lastly, NIKA is modular and extensible by design, offering APIs to facilitate the integration of new network scenarios and failure cases. We evaluate state-of-the-art LLM agents on NIKA and find that while larger models succeed more often in detecting network issues, they still struggle to localize faults and identify root causes. NIKA is open-source and available to the community: https://github.com/sands-lab/nika.

Paper Structure

This paper contains 23 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Network troubleshooting with an LLM agent.
  • Figure 2: NIKA's architecture. Box legend: HTML]dae8fbblue $=$ provided by NIKA; HTML]d5e8d5green $=$ extensible by the developer.
  • Figure 3: NIKA APIs to define and instantiate an incident.
  • Figure 4: GPT-5 agent versus network issues type (\ref{['tab:nika:failure_cases']}). Legend: LF: Link Failure, NE: Network node Error, NA: Network under Attack, EF: End-host Failure, MC: MisConfiguration, RC: Resource Contention.
  • Figure 5: Tool invocation distribution comparison between successful and failed troubleshooting.
  • ...and 1 more figures