Table of Contents
Fetching ...

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra

TL;DR

This paper tackles the challenge of scalable, on-policy evaluation for multi-turn, tool-using LLM agents by introducing Proxy State-Based Evaluation. It replaces heavy deterministic backends with an LLM-inferred proxy final state, guided by a scenario schema, a state tracker, and automated judges that verify goal completion and detect hallucinations. The framework yields stable, model-differentiating rankings, supports on-policy and off-policy data generation for training, and shows robustness through ablations and persona sensitivity. Practically, it offers a scalable, industry-ready evaluation environment that can accelerate iteration while preserving rigorous state-based assessment.

Abstract

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

TL;DR

This paper tackles the challenge of scalable, on-policy evaluation for multi-turn, tool-using LLM agents by introducing Proxy State-Based Evaluation. It replaces heavy deterministic backends with an LLM-inferred proxy final state, guided by a scenario schema, a state tracker, and automated judges that verify goal completion and detect hallucinations. The framework yields stable, model-differentiating rankings, supports on-policy and off-policy data generation for training, and shows robustness through ablations and persona sensitivity. Practically, it offers a scalable, industry-ready evaluation environment that can accelerate iteration while preserving rigorous state-based assessment.

Abstract

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.
Paper Structure (42 sections, 1 equation, 9 figures, 1 table)

This paper contains 42 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Overview of the proxy state-based evaluation benchmark. In a multi-turn interaction, an LLM-based user simulator converses with a reasoning agent that plans and executes multi-step tool calls to LLM-based tool simulators. An LLM judge, calibrated with human experts, determines goal completion by checking the final proxy state. The benchmark 1) evaluates the reasoning agent’s ability to achieve goals via multi-turn dialogue and tool-calling, and also 2) yields conversation data with rewards and supporting a leaderboard for comparing reasoning agents.
  • Figure 2: A scenario $z$ specifies user goal $g(z)$ and user facts $u(z)$ (both used by user simulator and LLM judge), system facts $s_0(z)$ (used by tool simulators, state tracker, and LLM judge), expected final state $s^\ast(z)$, and expected agent behavior (both used by LLM judge). Arrows denote inputs. These fields drive the interactive simulation and proxy state-based evaluation.
  • Figure 3: Goal completion rate (GC) on testing scenarios $\mathcal{Z}_{\text{test}}$ across baseline reasoning agents and trained models. Error bars show the bootstrap standard error. Fine-tuning substantially improves the base Qwen3-30B-A3B-Thinking-2507 model. RE: reasoning effort.
  • Figure 4: Ablations on scenario facts increase hallucinations. We randomly remove a fraction of system facts$s_0(z)$ or user facts$u(z)$. Tool hallucination rate and user hallucination rate rise steadily with more facts being removed. Error bars show the bootstrap standard error.
  • Figure 5: User persona sensitivity analysis. Error due to user ($\mathrm{ER}_{\text{user}}$) and user hallucination rate ($\mathrm{HR}_{\text{user}}$) across three personas $p$. More challenging personas increase user-induced errors and user hallucination rates. Error bars denote bootstrap standard error.
  • ...and 4 more figures