Table of Contents
Fetching ...

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, Hayato Yamana

TL;DR

ToolBH provides a multi-level, depth-and-breadth diagnostic benchmark to study hallucinations when LLMs use external tools. It articulates a three-level solvability-planning-missing-tool framework and three breadth scenarios (MNT, PT, LFT), and validates the approach with 700 annotated samples across seven tasks and 14 LLMs. The results reveal that model performance hinges on more than size, with training data and response strategy significantly shaping tool-enabled reasoning; unsolvability remains a key failure mode, especially for open-weight models. The work offers granular error analysis and a scalable evaluation pipeline to guide future development of robust, tool-aware LLMs with safer, more reliable tool use.

Abstract

Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community has yet to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM's hallucinations through two perspectives: depth and breadth. In terms of depth, we propose a multi-level diagnostic process, including (1) solvability detection, (2) solution planning, and (3) missing-tool analysis. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. Furthermore, we developed seven tasks and collected 700 evaluation samples through multiple rounds of manual annotation. The results show the significant challenges presented by the ToolBH benchmark. The current advanced models Gemini-1.5-Pro and GPT-4o only achieve total scores of 45.3 and 37.0, respectively, on a scale of 100. In this benchmark, larger model parameters do not guarantee better performance; the training data and response strategies also play crucial roles in tool-enhanced LLM scenarios. Our diagnostic analysis indicates that the primary reason for model errors lies in assessing task solvability. Additionally, open-weight models suffer from performance drops with verbose replies, whereas proprietary models excel with longer reasoning.

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

TL;DR

ToolBH provides a multi-level, depth-and-breadth diagnostic benchmark to study hallucinations when LLMs use external tools. It articulates a three-level solvability-planning-missing-tool framework and three breadth scenarios (MNT, PT, LFT), and validates the approach with 700 annotated samples across seven tasks and 14 LLMs. The results reveal that model performance hinges on more than size, with training data and response strategy significantly shaping tool-enabled reasoning; unsolvability remains a key failure mode, especially for open-weight models. The work offers granular error analysis and a scalable evaluation pipeline to guide future development of robust, tool-aware LLMs with safer, more reliable tool use.

Abstract

Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community has yet to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM's hallucinations through two perspectives: depth and breadth. In terms of depth, we propose a multi-level diagnostic process, including (1) solvability detection, (2) solution planning, and (3) missing-tool analysis. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. Furthermore, we developed seven tasks and collected 700 evaluation samples through multiple rounds of manual annotation. The results show the significant challenges presented by the ToolBH benchmark. The current advanced models Gemini-1.5-Pro and GPT-4o only achieve total scores of 45.3 and 37.0, respectively, on a scale of 100. In this benchmark, larger model parameters do not guarantee better performance; the training data and response strategies also play crucial roles in tool-enhanced LLM scenarios. Our diagnostic analysis indicates that the primary reason for model errors lies in assessing task solvability. Additionally, open-weight models suffer from performance drops with verbose replies, whereas proprietary models excel with longer reasoning.
Paper Structure (45 sections, 2 equations, 18 figures, 9 tables)

This paper contains 45 sections, 2 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: Hallucinations (red) and expectations (blue) of the LLM's response to the use of the task for tools that do not have a correct answer. The problem example is taken from the AgentBoard DBLP:journals/corr/abs-2401-13178@agentboard Tool-Query dataset, tested with ChatGPT openaichatgpt@chatgpt. Wrong Tool reflects a common situation where LLM uses a provided wrong tool as the final answer; Non-existent Tool, on the other hand, is an example of hallucination.
  • Figure 2: The pipeline of ToolBH benchmark. In-breadth, we examine three scenarios (MNT, PT, LFT) that could induce hallucinations from the perspective of the toolset. We employ an in-depth, multi-level evaluation (solvability detection, solution planning, and missing-tool analysis) to diagnose the reasons for hallucinations in LLM.
  • Figure 3: Performance of top-3 performed models, compared across three scenarios and across levels 1 to 3.
  • Figure 4: Error Analysis of proprietary and open-weight models. The Y-axis represents the number of error cases.
  • Figure 5: Standardized relationships between response length and performance indicators for open-weight LLMs across three tool availability scenarios. Performance indicators: L1-EM (Level-1 Exact Match), L2-PR (Level-2 Progress Rate), L3-PR (Level-3 Progress Rate), and L3-MS (Level-3 Matching Score). Scenarios: MNT (Missing Necessary Tools), PT (Potential Tools), and LFT (Limited Functionality Tools). Each row depicts a specific performance indicator, while columns represent different scenarios.
  • ...and 13 more figures