Table of Contents
Fetching ...

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park

TL;DR

This work tackles the limitation of final-answer-centric evaluation for tool-augmented LLMs by introducing TRACE, a trajectory-focused framework that assesses reasoning across efficiency, hallucination, and adaptivity without relying on a fixed ground-truth path. It leverages an evidence bank to store step-level evidence, enabling post-hoc, ground-truth-free evaluation of multi-step tool use. The authors validate TRACE with a meta-evaluation dataset (Meta-GTA and Meta-m&m's) augmented from GTA and m&m's benchmarks, demonstrating accurate, scalable assessments even for small open-source LLMs, and show that TRACE provides deeper insights than existing state-consistency methods like PIPA. Applying TRACE to diverse agents solving tool-augmented tasks reveals meaningful differences in trajectory quality that final accuracy alone obscures, offering concrete guidance for improving efficiency, reducing hallucinations, and enhancing adaptivity in real-world settings.

Abstract

Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for evaluating these aspects is to compare an agent's trajectory with the ground-truth trajectory, but this approach is fundamentally limited since annotating all valid ground-truth trajectories is prohibitively expensive. However, a simple LLM-based evaluator struggles to assess trajectories in detail without ground truth. To effectively evaluate the agents in this manner, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating an evidence bank, which accumulates knowledge gathered from preceding reasoning steps, TRACE enables a multi-faceted analysis and evaluation of an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

TL;DR

This work tackles the limitation of final-answer-centric evaluation for tool-augmented LLMs by introducing TRACE, a trajectory-focused framework that assesses reasoning across efficiency, hallucination, and adaptivity without relying on a fixed ground-truth path. It leverages an evidence bank to store step-level evidence, enabling post-hoc, ground-truth-free evaluation of multi-step tool use. The authors validate TRACE with a meta-evaluation dataset (Meta-GTA and Meta-m&m's) augmented from GTA and m&m's benchmarks, demonstrating accurate, scalable assessments even for small open-source LLMs, and show that TRACE provides deeper insights than existing state-consistency methods like PIPA. Applying TRACE to diverse agents solving tool-augmented tasks reveals meaningful differences in trajectory quality that final accuracy alone obscures, offering concrete guidance for improving efficiency, reducing hallucinations, and enhancing adaptivity in real-world settings.

Abstract

Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent's performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for evaluating these aspects is to compare an agent's trajectory with the ground-truth trajectory, but this approach is fundamentally limited since annotating all valid ground-truth trajectories is prohibitively expensive. However, a simple LLM-based evaluator struggles to assess trajectories in detail without ground truth. To effectively evaluate the agents in this manner, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating an evidence bank, which accumulates knowledge gathered from preceding reasoning steps, TRACE enables a multi-faceted analysis and evaluation of an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

Paper Structure

This paper contains 28 sections, 3 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: An example of agents returning the same answer through different trajectories given the same task.
  • Figure 2: Tool outputs are stored in the evidence bank, which is used to detect hallucinations in each thought and to assess trajectory efficiency after the final answer. Adaptivity is measured following the use of an unavailable tool.
  • Figure 3: Time Efficiency Comparison of LLM Evaluators using TRACE on Meta-GTA dataset.
  • Figure 4: Model accuracy based on the number of tokens used and dialogue turns.
  • Figure 5: Case study: Both agents are correct but trajectory efficiency is different in GPT-4.1 and Qwen-72B cases.
  • ...and 14 more figures