Table of Contents
Fetching ...

RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation

Xiao Liu, Da Yin, Zirui Wu, Yansong Feng

TL;DR

This work introduces RefTool, a two-stage framework that (1) constructs executable tools from structured external references (e.g., textbooks) and (2) uses a hierarchical toolbox to select and apply these tools for solving problems. By grounding tool creation and tool selection in reference material, RefTool extends LLM reasoning beyond internal knowledge, enabling robust performance across causality, physics, and chemistry benchmarks. Empirical results show RefTool achieves an average accuracy improvement of 11.3% over baselines, with substantial cost savings and strong tool reusability, while human evaluation confirms high tool quality and alignment with expert judgments. The study demonstrates the value of external-reference grounding for generalizable, dataset-agnostic tool creation and demonstrates practical gains in complex, knowledge-intensive reasoning tasks.

Abstract

Tools enhance the reasoning capabilities of large language models (LLMs) in complex problem-solving tasks, but not all tasks have available tools. In the absence of predefined tools, prior works have explored instructing LLMs to generate tools on their own. However, such approaches rely heavily on the models' internal knowledge and would fail in domains beyond the LLMs' knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages structured external materials such as textbooks. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 11.3% on average accuracy, while being cost-efficient and broadly generalizable. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome knowledge limitations, demonstrating the value of grounding tool creation in external references for enhanced and generalizable reasoning.

RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation

TL;DR

This work introduces RefTool, a two-stage framework that (1) constructs executable tools from structured external references (e.g., textbooks) and (2) uses a hierarchical toolbox to select and apply these tools for solving problems. By grounding tool creation and tool selection in reference material, RefTool extends LLM reasoning beyond internal knowledge, enabling robust performance across causality, physics, and chemistry benchmarks. Empirical results show RefTool achieves an average accuracy improvement of 11.3% over baselines, with substantial cost savings and strong tool reusability, while human evaluation confirms high tool quality and alignment with expert judgments. The study demonstrates the value of external-reference grounding for generalizable, dataset-agnostic tool creation and demonstrates practical gains in complex, knowledge-intensive reasoning tasks.

Abstract

Tools enhance the reasoning capabilities of large language models (LLMs) in complex problem-solving tasks, but not all tasks have available tools. In the absence of predefined tools, prior works have explored instructing LLMs to generate tools on their own. However, such approaches rely heavily on the models' internal knowledge and would fail in domains beyond the LLMs' knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages structured external materials such as textbooks. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 11.3% on average accuracy, while being cost-efficient and broadly generalizable. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome knowledge limitations, demonstrating the value of grounding tool creation in external references for enhanced and generalizable reasoning.

Paper Structure

This paper contains 49 sections, 2 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Overview of the RefTool framework, which consists of two modules: tool creation (left) and tool utilization (right).
  • Figure 2: Example of a generated tool and its corresponding reference content. Code comments are omitted due to space limits.
  • Figure 3: Example case of GPT-4o with (right) and without (left) RefTool.
  • Figure 4: Example case of GPT-4o on a causal problem with (right) and without (left) RefTool. This is the detailed version of Figure \ref{['fig-case']}
  • Figure 5: Example case of Gemini-1.5-Pro on a physical problem with (right) and without (left) RefTool.
  • ...and 8 more figures