Table of Contents
Fetching ...

TOOLVERIFIER: Generalization to New Tools via Self-Verification

Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo Shang, Jane Dwivedi-Yu

TL;DR

ToolVerifier introduces a self-verification framework that generalizes tool use to unseen APIs by decomposing tool calls into tool selection and parameter generation, each augmented with contrastive verification questions. A synthetic ToolSelect dataset supports zero-shot tool selection from large tool libraries, while few-shot demonstrations accompany parameter generation; verification questions are generated offline to refine decisions. Across four ToolBench tasks with 17 unseen tools, ToolVerifier achieves a 22% average improvement over few-shot baselines, with verification contributing additional gains (up to 14 points in parameter verification). The approach demonstrates that dual-stage verification and synthetic data can substantially enhance robustness and generalization in tool-based LLM behavior, with practical implications for building general-purpose assistants capable of new-tool adaptation.

Abstract

Teaching language models to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, language models still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during (1) tool selection; and (2) parameter generation. We construct synthetic, high-quality, self-generated data for this goal using Llama-2 70B, which we intend to release publicly. Extensive experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines, even in scenarios where the distinctions between candidate tools are finely nuanced.

TOOLVERIFIER: Generalization to New Tools via Self-Verification

TL;DR

ToolVerifier introduces a self-verification framework that generalizes tool use to unseen APIs by decomposing tool calls into tool selection and parameter generation, each augmented with contrastive verification questions. A synthetic ToolSelect dataset supports zero-shot tool selection from large tool libraries, while few-shot demonstrations accompany parameter generation; verification questions are generated offline to refine decisions. Across four ToolBench tasks with 17 unseen tools, ToolVerifier achieves a 22% average improvement over few-shot baselines, with verification contributing additional gains (up to 14 points in parameter verification). The approach demonstrates that dual-stage verification and synthetic data can substantially enhance robustness and generalization in tool-based LLM behavior, with practical implications for building general-purpose assistants capable of new-tool adaptation.

Abstract

Teaching language models to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, language models still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during (1) tool selection; and (2) parameter generation. We construct synthetic, high-quality, self-generated data for this goal using Llama-2 70B, which we intend to release publicly. Extensive experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines, even in scenarios where the distinctions between candidate tools are finely nuanced.
Paper Structure (51 sections, 4 figures, 3 tables)

This paper contains 51 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of ToolVerifier. Starting with a candidate tool list and a user instruction, ToolVerifier initially identifies the top two tools. Subsequently, it generates a verification question by contrasting the selected tools and answers it. Finally, this information is appended to the context, leading to the final tool choice. The parameter generation follows a similar pipeline, wherein two candidate values are obtained for each parameter (latitude in the above figure). Subsequently, the verification question is used to finalize the parameter value.
  • Figure 2: Illustrative training example from our synthetically constructed tool selection dataset ToolSelect. Given a user instruction and a set of tools to choose from, the output consists of reasoning notes ("Thought") and the final tool selection ("Act").
  • Figure 3: Verification method for tool selection: a constrastive question is generated that can then be answered to help discern among the top two predicted tools.
  • Figure 4: We analyze various aspects of our synthetic ToolSelect training data including the ordering of the candidate tool list ("No Shuffle"), difficulty level ("No Hard Data"), and the length of reasoning notes ("Short Reasoning"). We find samples with longer reasoning notes, difficult samples, and randomly ordered candidate tool lists contribute to high performance ("Full Data").