Table of Contents
Fetching ...

Reducing Tool Hallucination via Reliability Alignment

Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, Kai Yu

TL;DR

This paper tackles tool hallucinations in tool-augmented LLMs by defining a taxonomy of tool selection and usage errors, and introducing a reliability-focused evaluation framework. It introduces RePR and Benefit-Cost Utility metrics, and builds RelyToolBench to stress reliability via Missing Parameter and Unmatched Tools subsets. The Relign framework expands the action space with indecisive actions and employs SFT and DPO plus a data-synthesis pipeline to train models to defer, clarify, or switch tools, reducing hallucinations. Empirical results show Relign lowers tool-hallucination rates, increases reliable task success, and improves efficiency, with better generalization to OOD API tasks; these findings highlight the value of reliability-aligned decision-making in real-world tool use by LLMs.

Abstract

Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real-world applications. However, tool hallucinations, where models either select inappropriate tools or misuse them, pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To systematically address this issue, we define and categorize tool hallucinations into two main types, tool selection hallucination and tool usage hallucination. To evaluate and mitigate these issues, we introduce RelyToolBench, which integrates specialized test cases and novel metrics to assess hallucination-aware task success and efficiency. Finally, we propose Relign, a reliability alignment framework that expands the tool-use action space to include indecisive actions, allowing LLMs to defer tool use, seek clarification, or adjust tool selection dynamically. Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions.

Reducing Tool Hallucination via Reliability Alignment

TL;DR

This paper tackles tool hallucinations in tool-augmented LLMs by defining a taxonomy of tool selection and usage errors, and introducing a reliability-focused evaluation framework. It introduces RePR and Benefit-Cost Utility metrics, and builds RelyToolBench to stress reliability via Missing Parameter and Unmatched Tools subsets. The Relign framework expands the action space with indecisive actions and employs SFT and DPO plus a data-synthesis pipeline to train models to defer, clarify, or switch tools, reducing hallucinations. Empirical results show Relign lowers tool-hallucination rates, increases reliable task success, and improves efficiency, with better generalization to OOD API tasks; these findings highlight the value of reliability-aligned decision-making in real-world tool use by LLMs.

Abstract

Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real-world applications. However, tool hallucinations, where models either select inappropriate tools or misuse them, pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To systematically address this issue, we define and categorize tool hallucinations into two main types, tool selection hallucination and tool usage hallucination. To evaluate and mitigate these issues, we introduce RelyToolBench, which integrates specialized test cases and novel metrics to assess hallucination-aware task success and efficiency. Finally, we propose Relign, a reliability alignment framework that expands the tool-use action space to include indecisive actions, allowing LLMs to defer tool use, seek clarification, or adjust tool selection dynamically. Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions.

Paper Structure

This paper contains 35 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Different types of tool hallucination.
  • Figure 2: Evaluation process of tool hallucination.
  • Figure 3: Metric comparison between reliable and original pass rate. O, MP, and UT represent the original, missing parameter, and unmatched tools subsets, respectively.
  • Figure 4: The system illustration of Relign.
  • Figure 5: Comparison of performance metrics between the baseline and Relign across three subsets: Original (O), Missing Parameter (MP), and Unmatched Tools (UT).
  • ...and 4 more figures