GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution
Yining Lu, Haoping Yu, Daniel Khashabi
TL;DR
<3-5 sentence high-level summary> GEAR tackles the challenge of generalizing tool usage for augmented language systems without task-specific demonstrations or fine-tuning. It splits grounding and execution into a grounding stage powered by small language models using semantic and pattern-based signals, and an execution stage that uses a large language model to generate the tool API call. The approach achieves higher grounding precision and better downstream accuracy while reducing expensive LLM usage, and it generalizes to novel tasks, larger tool libraries, and smaller language models. Empirically, GEAR delivers strong performance across 14 datasets in 6 tasks and demonstrates robustness to tool library size and user tasks, enabling scalable, generalizable tool integration for real-world applications.
Abstract
Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to various tasks that require tool use while not relying on task-specific demonstrations. GEAR achieves better efficiency by delegating tool grounding and execution to small language models (SLM) and LLM, respectively; while leveraging semantic and pattern-based evaluation at both question and answer levels for generalizable tool grounding. We evaluate GEAR on 14 datasets across 6 downstream tasks, demonstrating its strong generalizability to novel tasks, tools and different SLMs. Despite offering more efficiency, GEAR achieves higher precision in tool grounding compared to prior strategies using LLM prompting, thus improving downstream accuracy at a reduced computational cost. For example, we demonstrate that GEAR-augmented GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of better tool use.
