Table of Contents
Fetching ...

What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks

Chengrui Huang, Zhengliang Shi, Yuntao Wen, Xiuying Chen, Peng Han, Shen Gao, Shuo Shang

TL;DR

This paper presents a systematic empirical study of the stability of tool-learning frameworks for LLMs, distinguishing internal factors (model choice, decoding, and tool-use framework) from external factors (prompts and toolsets). Using ToolBench with I1-instruction and I1-tool tasks, it shows that even strong models exhibit instability under perturbations, though increased trial-and-error exploration and tree-based tool selection can improve performance at higher costs. Key findings include substantial sensitivity to decoding temperature, inference steps, and toolset order/scale, as well as noticeable gains from customized system prompts for open models. The work offers practical guidance for designing robust tool-use agents and sets the stage for future research into multi-modal and dynamic tool-use environments.

Abstract

Tool learning methods have enhanced the ability of large language models (LLMs) to interact with real-world applications. Many existing works fine-tune LLMs or design prompts to enable LLMs to select appropriate tools and correctly invoke them to meet user requirements. However, it is observed in previous works that the performance of tool learning varies from tasks, datasets, training settings, and algorithms. Without understanding the impact of these factors, it can lead to inconsistent results, inefficient model deployment, and suboptimal tool utilization, ultimately hindering the practical integration and scalability of LLMs in real-world scenarios. Therefore, in this paper, we explore the impact of both internal and external factors on the performance of tool learning frameworks. Through extensive experiments on two benchmark datasets, we find several insightful conclusions for future work, including the observation that LLMs can benefit significantly from increased trial and exploration. We believe our empirical study provides a new perspective for future tool learning research.

What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks

TL;DR

This paper presents a systematic empirical study of the stability of tool-learning frameworks for LLMs, distinguishing internal factors (model choice, decoding, and tool-use framework) from external factors (prompts and toolsets). Using ToolBench with I1-instruction and I1-tool tasks, it shows that even strong models exhibit instability under perturbations, though increased trial-and-error exploration and tree-based tool selection can improve performance at higher costs. Key findings include substantial sensitivity to decoding temperature, inference steps, and toolset order/scale, as well as noticeable gains from customized system prompts for open models. The work offers practical guidance for designing robust tool-use agents and sets the stage for future research into multi-modal and dynamic tool-use environments.

Abstract

Tool learning methods have enhanced the ability of large language models (LLMs) to interact with real-world applications. Many existing works fine-tune LLMs or design prompts to enable LLMs to select appropriate tools and correctly invoke them to meet user requirements. However, it is observed in previous works that the performance of tool learning varies from tasks, datasets, training settings, and algorithms. Without understanding the impact of these factors, it can lead to inconsistent results, inefficient model deployment, and suboptimal tool utilization, ultimately hindering the practical integration and scalability of LLMs in real-world scenarios. Therefore, in this paper, we explore the impact of both internal and external factors on the performance of tool learning frameworks. Through extensive experiments on two benchmark datasets, we find several insightful conclusions for future work, including the observation that LLMs can benefit significantly from increased trial and exploration. We believe our empirical study provides a new perspective for future tool learning research.
Paper Structure (44 sections, 9 figures, 11 tables)

This paper contains 44 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Illustration of various factors that may affect the robustness of tool learning methods.
  • Figure 2: The default tool-use framework in our work. The LLM is guided to iteratively decide which tool to use (Thought), execute the selected tool (Action), and incorporate the execution results into context (Observation) for the next iteration prediction.
  • Figure 3: The overall framework of our work, which benchmarks tool-use models under various scenarios to investigate the internal and external factors that potentially affect their stability.
  • Figure 4: Self-consistency Success Rate of different models.
  • Figure 5: Impact of different foundation models.
  • ...and 4 more figures