Table of Contents
Fetching ...

ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Murong Yue, Zhiwei Liu, Liangwei Yang, Jianguo Zhang, Zuxin Liu, Haolin Chen, Ziyu Yao, Silvio Savarese, Caiming Xiong, Shelby Heinecke, Huan Wang

TL;DR

ToolLibGen tackles the scalability bottleneck of tool-augmented LLM reasoning by automatically refactoring fragmented, question-specific tools into a structured Python library. It employs a three-stage pipeline (Question-Specific Tool Creation, Tool Clustering, Tool Aggregation) coordinated by a dual-LLM system plus a multi-agent loop (Coding Agent and Reviewing Agent) to preserve functional fidelity. The resulting library enables more accurate and scalable tool retrieval, improving reasoning performance across science, math, and medical QA, with strong results in both seen and unseen scenarios. The work demonstrates notable gains over baselines and provides a principled design for organizing reusable tools, with reproducibility details and avenues for future co-evolution of tool creation and usage strategies.

Abstract

Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.

ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

TL;DR

ToolLibGen tackles the scalability bottleneck of tool-augmented LLM reasoning by automatically refactoring fragmented, question-specific tools into a structured Python library. It employs a three-stage pipeline (Question-Specific Tool Creation, Tool Clustering, Tool Aggregation) coordinated by a dual-LLM system plus a multi-agent loop (Coding Agent and Reviewing Agent) to preserve functional fidelity. The resulting library enables more accurate and scalable tool retrieval, improving reasoning performance across science, math, and medical QA, with strong results in both seen and unseen scenarios. The work demonstrates notable gains over baselines and provides a principled design for organizing reusable tools, with reproducibility details and avenues for future co-evolution of tool creation and usage strategies.

Abstract

Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.

Paper Structure

This paper contains 36 sections, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: We obtained function-related tools $Tool_{1-3}$ from different problems and $Tool_2$ and $Tool_3$ have overlapping functionality. Through the ToolLibGen, we integrated the three discrete, question-specific tools into a single class that covers all functionalities and a facade function that covers all possible parameter inputs.
  • Figure 2: The pipeline of our proposed method (: General LLM; : $LLM_{solver}$). The LLM first generates and validates question-specific tools. Then it proposes clusters and assigns each tool to specific clusters. For each cluster, a coding agent abstracts the functionality of tools and writes code, while a reviewing agent validates the code with $LLM_{solver}$ to preserve the original functionalities.
  • Figure 3: (Left) The retriever accuracy changes as the number of questions for tool-making increases from 1k to 20k. (Right) A case study showing how ToolLibGen facilitates tool retrieval.
  • Figure 4: (a) Ablation results showing the effectiveness of our design; (b) Error distribution of GPT-4.1 in reasoning augmented by tools generated by ToolLibGen.