From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Seungdong Yoa; Sanghyu Yoon; Suhee Yoon; Dongmin Kim; Ye Seul Sim; Junhyun Lee; Woohyung Lim

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Seungdong Yoa, Sanghyu Yoon, Suhee Yoon, Dongmin Kim, Ye Seul Sim, Junhyun Lee, Woohyung Lim

TL;DR

This work proposes an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems, enabling progressive evaluation of large language models without manually curated datasets.

Abstract

The evaluation of large language models (LLMs) has predominantly relied on static datasets, which offer limited scalability and fail to capture the evolving reasoning capabilities of recent models. To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems. Within this protocol, a teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks, and a student agent attempts to solve the validated problems. An invalid problem is revised by the teacher agent until it passes validation. If the student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants. Consequently, the benchmark scales in difficulty automatically as more capable agents are substituted into any role, enabling progressive evaluation of large language models without manually curated datasets. Adopting text anomaly detection as our primary evaluation format, which demands cross-sentence logical inference and resists pattern-matching shortcuts, we demonstrate that this protocol systematically exposes corner-case reasoning errors that conventional benchmarks fail to reveal. We further advocate evaluating systems along several complementary axes including cross-model pairwise performance and progress between the initial and orchestrator-finalized problems. By shifting the focus from fixed datasets to dynamic protocols, our approach offers a sustainable direction for evaluating ever-evolving language models and introduces a research agenda centered on the co-evolution of agent-centric benchmarks.

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

TL;DR

Abstract

Paper Structure (60 sections, 3 equations, 6 figures, 11 tables)

This paper contains 60 sections, 3 equations, 6 figures, 11 tables.

Introduction
ATAD: Benchmark Protocol Design and Operation
Agent Roles
Protocol Phases
Initialization Phase (Base Problem Generation)
Adaptive Difficulty Scaling Phase
Evaluation Phase
Key Features
Task Design for Text Anomaly Detection
Task Overview and Motivation
Task Taxonomy: Seven Types of Text Anomalies and Reasoning Skills Targeted
Experiments and Results
Evaluation Setup
Overall Performance Evaluation
Valid Difficulty Scaling via Competitive Agents
...and 45 more sections

Figures (6)

Figure 1: Comparison of text anomaly samples.Left: Existing benchmarks include obvious anomalies (e.g., complete off-topic from sports news to economy news) that are clear but too trivial. Right: ATAD examples introduce subtle shifts within context (e.g., benefits to ethics in healthcare AI), preserving clarity while presenting reasoning-intensive challenges. Our collaborative agents resolve the clarity-difficulty trade-off through iterative task refinement.
Figure 2: Illustration of the overall ATAD protocol. Three agents iteratively interact to generate progressively challenging benchmarks designed to uncover subtle reasoning weaknesses in LLMs.
Figure 3: Examples of the seven task types of text anomalies. With the exception of T2, each task requires identifying a guaranteed anomaly within the sample (e.g., by selecting a sentence or choice), rather than performing a simple binary classification.
Figure 4: Consistency in Benchmark Generation.
Figure 5: Refined after rejection. The left side shows a rejected T1 (Sentence Context Anomaly) problem where the anomaly was conceptually weak and difficult to identify. The Orchestrator’s feedback noted the lack of semantic inconsistency and suggested stronger topic divergence. The revised version (right) introduces a scientifically framed yet incorrect statement about Camus’s influences, resulting in a clearer and more pedagogically effective anomaly. This highlights the Orchestrator’s role in guiding high-difficulty problem construction.
...and 1 more figures

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

TL;DR

Abstract

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)