Table of Contents
Fetching ...

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

Penny Chong, Harshavardhan Abichandani, Jiyuan Shen, Atin Ghosh, Min Pyae Moe, Yifan Mai, Daniel Dahlmeier

Abstract

Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent's design.

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

Abstract

Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent's design.
Paper Structure (34 sections, 11 equations, 19 figures, 16 tables, 1 algorithm)

This paper contains 34 sections, 11 equations, 19 figures, 16 tables, 1 algorithm.

Figures (19)

  • Figure 1: Our proposed two-step automated error discovery approach that automatically identifies common errors of the agent based on judge and agent inconsistencies. Identical error colors indicate that similar low-level errors are clustered into the same high-level category.
  • Figure 2: Progress curves for selected ToolSandbox samples. (a) search_reminder_with_recency_upcoming: mistral-nemo (non-expert, purple; AUC=0.88, PPT=0.20) vs. gpt-5 (non-expert, blue; AUC=0.61, PPT=0.20). (b) find_current_city_low_battery_mode: mistral-nemo (expert, purple; AUC=0.77) vs. gpt-5 (non-expert, blue; AUC=0.64). (c) add_reminder_content_and_date_and_time: gpt-4.1 (non-expert, green; AUC=0.50) vs. mistral-large (non-expert, red; AUC=0.34).
  • Figure 3: $\mathbb{E}[progress(i, G_i, \tau_i^l)]$ (dot) and $\mathrm{Var}[progress(i, G_i, \tau_i^l)]$ (error bar) on $\tau^2$-bench gpt-4o-mini agent using non-expert user proxy gpt-4.1. Each dot is an agent trajectory from a single trial and each task sample from the airline domain is evaluated using $n=k=20$ agent trials. We display only three example samples here. Sample 14 belongs to the hard split and others in the easy split.
  • Figure 4: (a) $\tau^2$-bench airline sample_8: mistral-nemo expert (AUC=1.0, PPT=1.0) vs. gpt-4o-mini expert (AUC=0.982, PPT=0.5) and mistral-nemo non-expert (AUC=0.482, PPT=0.25) vs. gpt-4o-mini non-expert (AUC=0.714, PPT=0.125) (b) $\tau^2$-bench airline sample_8: barres2025tau's original evaluation approach with $n=k=20$ trials.
  • Figure 5: $\tau^2$-bench airline sample 14. The blue box shows a truncated task instruction $i \in I$ for the non-expert user proxy gpt-4.1 model. The green boxes contain the truncated dialogue for trajectory 6 (left) and agent’s trajectory 18 (right). The agent model is gpt-4o-mini. The top-right box shows the errors identified. Zoom in for a larger view.
  • ...and 14 more figures