Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models
Mengsong Wu, Tong Zhu, Han Han, Xiang Zhang, Wenbiao Shao, Wenliang Chen
TL;DR
This work presents CoTools, a framework that enables frozen language models to efficiently call external tools during chain-of-thought reasoning, even when the tool pool is enormous and contains unseen tools. It achieves this through three modules—Tool Judge, Tool Retriever, and Tool Calling—that operate on the model's hidden states to decide when to invoke tools, retrieve the relevant tool, and perform the tool call without updating the LLM. Evaluations across numerical reasoning and KBQA benchmarks, including a large-scale, unseen-tool dataset STQuestions, show that CoTools consistently outperforms baselines like 0-shot ChatGPT and ToolkenGPT, and that the approach scales to about 1000 tools while preserving interpretability through analysis of hidden-state dimensions. The results suggest that leveraging semantic representations of frozen LLMs for tool selection can significantly extend practical tool usage, with implications for robust, real-world AI agents. The paper also discusses limitations, including handling tools with multiple return values and the need for large-scale real-world tool pools for further validation.
Abstract
Tool learning can further broaden the usage scenarios of large language models (LLMs). However most of the existing methods either need to finetune that the model can only use tools seen in the training data, or add tool demonstrations into the prompt with lower efficiency. In this paper, we present a new Tool Learning method Chain-of-Tools. It makes full use of the powerful semantic representation capability of frozen LLMs to finish tool calling in CoT reasoning with a huge and flexible tool pool which may contain unseen tools. Especially, to validate the effectiveness of our approach in the massive unseen tool scenario, we construct a new dataset SimpleToolQuestions. We conduct experiments on two numerical reasoning benchmarks (GSM8K-XL and FuncQA) and two knowledge-based question answering benchmarks (KAMEL and SimpleToolQuestions). Experimental results show that our approach performs better than the baseline. We also identify dimensions of the model output that are critical in tool selection, enhancing the model interpretability. Our code and data are available at: https://github.com/fairyshine/Chain-of-Tools .
