Table of Contents
Fetching ...

NetArena: Dynamic Benchmarks for AI Agents in Network Automation

Yajie Zhou, Jiajun Ruan, Eric S. Wang, Sadjad Fouladi, Francis Y. Yan, Kevin Hsieh, Zaoxing Liu

Abstract

As AI agents expand into high-stakes domains like network system operations, evaluating their real-world reliability becomes increasingly critical. However, existing benchmarks risk contamination due to static design, show high statistical variance from limited dataset size, and fail to reflect the complexity of production environments. We present NetArena, a dynamic benchmark generation framework for network applications. NetArena introduces a novel abstraction and unified interface that generalize across diverse tasks, enabling dynamic benchmarking despite the heterogeneity of network workloads. At runtime, users can generate unlimited queries on demand. NetArena integrates with network emulators to measure correctness, safety, and latency during execution. We demonstrate NetArena on three representative applications and find that (1) NetArena significantly improves statistical reliability across AI agents, reducing confidence-interval overlap from 85% to 0, (2) agents achieve only 13-38% average performance (as low as 3%) for large-scale, realistic queries, and (3) it exposes more fine-grained behaviors that static, correctness-only benchmarks miss. NetArena also enables use cases such as SFT and RL fine-tuning on network system tasks. Code is available at https://github.com/Froot-NetSys/NetArena.

NetArena: Dynamic Benchmarks for AI Agents in Network Automation

Abstract

As AI agents expand into high-stakes domains like network system operations, evaluating their real-world reliability becomes increasingly critical. However, existing benchmarks risk contamination due to static design, show high statistical variance from limited dataset size, and fail to reflect the complexity of production environments. We present NetArena, a dynamic benchmark generation framework for network applications. NetArena introduces a novel abstraction and unified interface that generalize across diverse tasks, enabling dynamic benchmarking despite the heterogeneity of network workloads. At runtime, users can generate unlimited queries on demand. NetArena integrates with network emulators to measure correctness, safety, and latency during execution. We demonstrate NetArena on three representative applications and find that (1) NetArena significantly improves statistical reliability across AI agents, reducing confidence-interval overlap from 85% to 0, (2) agents achieve only 13-38% average performance (as low as 3%) for large-scale, realistic queries, and (3) it exposes more fine-grained behaviors that static, correctness-only benchmarks miss. NetArena also enables use cases such as SFT and RL fine-tuning on network system tasks. Code is available at https://github.com/Froot-NetSys/NetArena.

Paper Structure

This paper contains 30 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: NetArena introduces a unified state-action abstraction for a wide range of network and system applications, integrating with real-world emulators to generate dynamic queries and ground truths, and enabling automated correctness and safety evaluation.
  • Figure 2: Real-world network automation examples.
  • Figure 3: NetArena enables dynamic query generation with large query sizes, significantly improving the statistical confidence of agent comparisons.
  • Figure 4: Breakdown analysis across complexity levels (L1–L3, gets harder counter-clockwise). Agent strengths vary by application, metric, and task complexity.
  • Figure 5: Supervised Fine-tuned (SFT) model performance at different complexity levels.
  • ...and 4 more figures