One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Chengyu Shen; Yanheng Hou; Minghui Pan; Runming He; Zhen Hao Wong; Meiyi Qiang; Zhou Liu; Hao Liang; Peichao Lai; Zeang Sheng; Wentao Zhang

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao Zhang

TL;DR

One-Eval is presented, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows, and incorporates human-in-the-loop checkpoints for review, editing, and rollback.

Abstract

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

TL;DR

Abstract

Paper Structure (37 sections, 9 figures, 3 tables)

This paper contains 37 sections, 9 figures, 3 tables.

Introduction
Related Work
System Design
Framework Overview
NL2Bench
Benchmark Resolution and Configuration
Hierarchical Benchmark Resolution.
Unified Configuration and Heterogeneous Data Adaptation.
Metric Recommendation and Reporting
Experiments
Case Study
End-to-End Success Rate
Feature-level Comparison
Conclusion
Alignment with Representative Evaluation Frameworks
...and 22 more sections

Figures (9)

Figure 1: One-Eval overview. One-Eval converts a natural-language evaluation request into an executable EvalPlan (NL2Bench), resolves and configures benchmarks by automatic dataset download and schema normalization (BenchResolve), and produces task-aware metrics and a decision-oriented evaluation report (Metrics & Reporting), with human-in-the-loop refinement at key steps.
Figure :
Figure :
Figure :
Figure :
...and 4 more figures

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

TL;DR

Abstract

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)