Table of Contents
Fetching ...

Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture

Boming Xia, Qinghua Lu, Liming Zhu, Zhenchang Xing, Dehai Zhao, Hao Zhang

TL;DR

The paper tackles the challenge of evaluating LLM agents whose open-ended, evolving behavior defies traditional fixed benchmarks. It proposes EDDOps, an evaluation-driven development and operations approach that integrates offline and online evaluation into a continuous feedback loop to drive runtime adaptation and governed redevelopment. Guided by a multivocal literature review, the authors derive a process model and a reference architecture that place evaluation at the core of design and operation, ensuring traceability and safety as agent systems evolve. The work is validated through a tax-assistant caselet and practitioner triangulation, demonstrating practical applicability and architectural adequacy for real-world, dynamic deployments. Together, these contributions offer a systematic framework for safer, more accountable evolution of LLM agents in changing contexts and governance landscapes.

Abstract

Large Language Models (LLMs) have enabled the emergence of LLM agents, systems capable of pursuing under-specified goals and adapting after deployment. Evaluating such agents is challenging because their behavior is open ended, probabilistic, and shaped by system-level interactions over time. Traditional evaluation methods, built around fixed benchmarks and static test suites, fail to capture emergent behaviors or support continuous adaptation across the lifecycle. To ground a more systematic approach, we conduct a multivocal literature review (MLR) synthesizing academic and industrial evaluation practices. The findings directly inform two empirically derived artifacts: a process model and a reference architecture that embed evaluation as a continuous, governing function rather than a terminal checkpoint. Together they constitute the evaluation-driven development and operations (EDDOps) approach, which unifies offline (development-time) and online (runtime) evaluation within a closed feedback loop. By making evaluation evidence drive both runtime adaptation and governed redevelopment, EDDOps supports safer, more traceable evolution of LLM agents aligned with changing objectives, user needs, and governance constraints.

Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture

TL;DR

The paper tackles the challenge of evaluating LLM agents whose open-ended, evolving behavior defies traditional fixed benchmarks. It proposes EDDOps, an evaluation-driven development and operations approach that integrates offline and online evaluation into a continuous feedback loop to drive runtime adaptation and governed redevelopment. Guided by a multivocal literature review, the authors derive a process model and a reference architecture that place evaluation at the core of design and operation, ensuring traceability and safety as agent systems evolve. The work is validated through a tax-assistant caselet and practitioner triangulation, demonstrating practical applicability and architectural adequacy for real-world, dynamic deployments. Together, these contributions offer a systematic framework for safer, more accountable evolution of LLM agents in changing contexts and governance landscapes.

Abstract

Large Language Models (LLMs) have enabled the emergence of LLM agents, systems capable of pursuing under-specified goals and adapting after deployment. Evaluating such agents is challenging because their behavior is open ended, probabilistic, and shaped by system-level interactions over time. Traditional evaluation methods, built around fixed benchmarks and static test suites, fail to capture emergent behaviors or support continuous adaptation across the lifecycle. To ground a more systematic approach, we conduct a multivocal literature review (MLR) synthesizing academic and industrial evaluation practices. The findings directly inform two empirically derived artifacts: a process model and a reference architecture that embed evaluation as a continuous, governing function rather than a terminal checkpoint. Together they constitute the evaluation-driven development and operations (EDDOps) approach, which unifies offline (development-time) and online (runtime) evaluation within a closed feedback loop. By making evaluation evidence drive both runtime adaptation and governed redevelopment, EDDOps supports safer, more traceable evolution of LLM agents aligned with changing objectives, user needs, and governance constraints.

Paper Structure

This paper contains 38 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Research Methodology
  • Figure 2: Distribution of Evaluation Efforts Across Lifecycle Stages
  • Figure 3: Distribution of Evaluation Metrics
  • Figure 4: Distribution of Model-Level vs. System-Level Evaluations
  • Figure 5: Comparison of Adaptive vs. Static Evaluations in Academic vs. Grey Literature
  • ...and 4 more figures