Table of Contents
Fetching ...

ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

Shirley Kokane, Ming Zhu, Tulika Awalgaonkar, Jianguo Zhang, Thai Hoang, Akshara Prabhakar, Zuxin Liu, Tian Lan, Liangwei Yang, Juntao Tan, Rithesh Murthy, Weiran Yao, Zhiwei Liu, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong, Silivo Savarese

TL;DR

<3-5 sentence high-level summary> ToolScan addresses the need to diagnose and fix errors in tool-use by LLMs, going beyond final-success metrics. It introduces a 150-query, human-annotated dataset across 10 environments and 30+ tool-use tasks, identifies seven systematic error patterns, and provides a unified evaluation framework with a constructive feedback mechanism. The study reports extensive experiments across multiple open- and closed-weight LLMs, showing that larger models like GPT-4 and API-call–oriented fine-tuned models perform better, and that feedback and output format choices significantly influence error rates. The work offers actionable insights for error mitigation and guides future benchmark enhancements with richer environments and action families to improve tool-use robustness in LLMs.

Abstract

Evaluating Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce TOOLSCAN, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using TOOLSCAN, we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use these insights from TOOLSCAN to guide their error mitigation strategies.

ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

TL;DR

<3-5 sentence high-level summary> ToolScan addresses the need to diagnose and fix errors in tool-use by LLMs, going beyond final-success metrics. It introduces a 150-query, human-annotated dataset across 10 environments and 30+ tool-use tasks, identifies seven systematic error patterns, and provides a unified evaluation framework with a constructive feedback mechanism. The study reports extensive experiments across multiple open- and closed-weight LLMs, showing that larger models like GPT-4 and API-call–oriented fine-tuned models perform better, and that feedback and output format choices significantly influence error rates. The work offers actionable insights for error mitigation and guides future benchmark enhancements with richer environments and action families to improve tool-use robustness in LLMs.

Abstract

Evaluating Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce TOOLSCAN, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using TOOLSCAN, we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use these insights from TOOLSCAN to guide their error mitigation strategies.

Paper Structure

This paper contains 30 sections, 2 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Query Generation Workflow for ToolScan. Query and API Info collected from Open Source Toolsets are given as input to an LLM with a system prompt. The LLM is asked to generate augmented queries along with the API Calls required to solve that query.
  • Figure 2: Systemic workflow of an agent interacting with the environment assessing the prior information and the predefined goal to determine its next action.
  • Figure 3: Left: We see that percent of queries with Incorrect Function Name (IFN) error is higher when we augment queries with irrelevant terms (Higher is Better). Right: We see that model success rate is higher when model is provided with feedback which helps it to correct itself (Higher is Better).
  • Figure 4: Comparison of Structured Unstructured Format versus Incorrect Format Error (Higher is Better).
  • Figure 5: We see that environments which similar APIs tend to confuse the models leading to lower performance (Higher is Better).