Table of Contents
Fetching ...

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

Skyler Hallinan, Thejas Venkatesh, Xiang Ren, Sai Praneeth Karimireddy, Ashwin Paranjape, Yuhao Zhang, Jack Hessel

TL;DR

A simple framework, ToolObserver, is proposed, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories and outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings.

Abstract

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

TL;DR

A simple framework, ToolObserver, is proposed, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories and outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings.

Abstract

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.
Paper Structure (45 sections, 5 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 45 sections, 5 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: LLM agents may struggle when presented with opaque tools -- tools lacking clear description of their usage best practices or their failure modes. To succeed in these settings, we posit that LLM agents must explore tool usage to learn their true behaviors.
  • Figure 2: Examples from each of the environments in OpaqueToolsBench. For Chess and BrowseComp Domains, each setting has a fixed set of tools across all instances (chess and search engines respectively). In BFCL-Opaque, each instance has custom query-dependent tools. The dataset defines the input and output (green). For each instance, the agent makes tool calls iteratively (blue). For BFCL-Opaque and BrowseComp Domains we check for a match with the gold answer. For Chess, we match engine choices in green with optimal engine choices.
  • Figure 3: Performance of ToolObserver on BFCL-Opaque with increased iterations on the tool settings: A): Anon. Function Names, B): Anon. Function Names + Descriptions, C): Anon. Function Names + Param. Names. Iterations stop after full convergence. These are expanded versions of the \ref{['tab:bfcl_full']} results.
  • Figure 4: BFCL Exploration Prompt
  • Figure 5: The BFCL reflection prompt. It is split into three parts. A "pre-prompt", a "middle-prompt", and a "post-prompt". We concatenate them together along with the real usage behaviors (ie, the function calls and the outputs).
  • ...and 8 more figures