Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents
Divyanshu Saxena, Rishikesh Maurya, Xiaoxuan Ou, Gagan Somashekar, Shachee Mishra Gupta, Arun Iyer, Yu Kang, Chetan Bansal, Aditya Akella, Saravan Rajmohan
TL;DR
The paper tackles the challenge of evaluating enterprise-scale LLM agents amid evolving requirements and sparse ground truth. It introduces a continuous benchmark generation pipeline that leverages developer-authored knowledge bases and reference commits to produce adaptive benchmarks with minimal manual labor. Key contributions include KB-based task descriptions, a method to map migrations to ground-truth diffs, and empirical validation showing high precision/recall and cleaner ground truth compared to manually constructed benchmarks. This approach enables longitudinal, actionable evaluation and supports rapid, targeted improvements of production-grade AI agents in dynamic enterprise environments.
Abstract
The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and computing multiple evaluation metrics for the agent. While sufficient for simple coding tasks, these benchmarks fall short for enterprise-scale agents, where services and requirements evolve continuously and ground-truth examples are sparse. We propose a process of benchmark generation that helps evolve the benchmarks as the requirements change and perform robust evaluation of evolving AI agents. We instantiate this approach for a case study of service migration from one deployment platform to another at a large public enterprise. Our approach relies on semi-structured documents where developers express the high-level intent, and uses state-of-the-art LLMs to generate benchmarks from just a small number of such documents. Overall, this process results in a maintainable evaluation framework, enabling rapid feedback on agent performance and facilitating targeted improvements.
