GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution

Yining Lu; Haoping Yu; Daniel Khashabi

GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution

Yining Lu, Haoping Yu, Daniel Khashabi

TL;DR

<3-5 sentence high-level summary> GEAR tackles the challenge of generalizing tool usage for augmented language systems without task-specific demonstrations or fine-tuning. It splits grounding and execution into a grounding stage powered by small language models using semantic and pattern-based signals, and an execution stage that uses a large language model to generate the tool API call. The approach achieves higher grounding precision and better downstream accuracy while reducing expensive LLM usage, and it generalizes to novel tasks, larger tool libraries, and smaller language models. Empirically, GEAR delivers strong performance across 14 datasets in 6 tasks and demonstrates robustness to tool library size and user tasks, enabling scalable, generalizable tool integration for real-world applications.

Abstract

Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to various tasks that require tool use while not relying on task-specific demonstrations. GEAR achieves better efficiency by delegating tool grounding and execution to small language models (SLM) and LLM, respectively; while leveraging semantic and pattern-based evaluation at both question and answer levels for generalizable tool grounding. We evaluate GEAR on 14 datasets across 6 downstream tasks, demonstrating its strong generalizability to novel tasks, tools and different SLMs. Despite offering more efficiency, GEAR achieves higher precision in tool grounding compared to prior strategies using LLM prompting, thus improving downstream accuracy at a reduced computational cost. For example, we demonstrate that GEAR-augmented GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of better tool use.

GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution

TL;DR

Abstract

Paper Structure (46 sections, 8 equations, 9 figures, 15 tables, 1 algorithm)

This paper contains 46 sections, 8 equations, 9 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Tool Use via Fine-tuning.
Tool Use via In-Context Learning.
Embodied Language Model in Robotics.
GEAR: Generalizable and Efficient Augmented Tool Resolution
Semantic Similarity Score
Pattern Similarity Score
Preliminary guess.
Tool-based response.
Scoring the alignment.
Experiment Setup
GEAR Implementation.
Tools.
Datasets.
...and 31 more sections

Figures (9)

Figure 1: GEAR leverages small language models (SLM) to facilitate the process of tool grounding for a given query and has the ability to add and utilize new tools for novel tasks without the need for fine-tuning or extra demonstrations. GEAR utilizes a large language model (LLM) in the tool execution module to ensure the accuracy of the final answer.
Figure 2: GEAR framework. It computes the pattern score by comparing the preliminary answer (in gray line) to tool responses (in green box) and the semantic score by comparing the query to tool descriptions (in blue box). Grounding tool with the highest weighted average score and executing it via a LLM to obtain the final answer.
Figure 3: Grounding accuracy of GEAR when the tool library is expanded from 4 to 10 tools (§\ref{['subsec:grounding result']}). We incrementally incorporate these tools: Multilingual QA, Timezone Converter, Sleep, Logarithmic Calculator, and Movement Controller.
Figure 4: A comparison of output patterns between SLMs and LLM. The lines subsequent to [Question] represents the output generated by the corresponding model, with patterns (number, symbol and English alphabet) labeled in different colors. While SLMs tend to be less accurate than LLM, their responses provide sufficient clues (pattern distribution) about the form of the expected answer.
Figure 5: Averaged GEAR grounding performance over SLM sizes (number of parameters, in log scale) on Arithmetic and Commonsense QA tasks. Each task is evaluated by three datasets. GEAR with SLM has a similar grounding accuracy as with LLM.
...and 4 more figures

GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution

TL;DR

Abstract

GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (9)