Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Wonjoong Kim; Sangwu Park; Yeonjun In; Sein Kim; Dongha Lee; Chanyoung Park

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park

TL;DR

This work tackles the limitation of final-answer-centric evaluation for tool-augmented LLMs by introducing TRACE, a trajectory-focused framework that assesses reasoning across efficiency, hallucination, and adaptivity without relying on a fixed ground-truth path. It leverages an evidence bank to store step-level evidence, enabling post-hoc, ground-truth-free evaluation of multi-step tool use. The authors validate TRACE with a meta-evaluation dataset (Meta-GTA and Meta-m&m's) augmented from GTA and m&m's benchmarks, demonstrating accurate, scalable assessments even for small open-source LLMs, and show that TRACE provides deeper insights than existing state-consistency methods like PIPA. Applying TRACE to diverse agents solving tool-augmented tasks reveals meaningful differences in trajectory quality that final accuracy alone obscures, offering concrete guidance for improving efficiency, reducing hallucinations, and enhancing adaptivity in real-world settings.

Abstract

Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for evaluating these aspects is to compare an agent's trajectory with the ground-truth trajectory, but this approach is fundamentally limited since annotating all valid ground-truth trajectories is prohibitively expensive. However, a simple LLM-based evaluator struggles to assess trajectories in detail without ground truth. To effectively evaluate the agents in this manner, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating an evidence bank, which accumulates knowledge gathered from preceding reasoning steps, TRACE enables a multi-faceted analysis and evaluation of an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

TL;DR

Abstract

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)