Table of Contents
Fetching ...

ToolFuzz -- Automated Agent Tool Testing

Ivan Milev, Mislav Balunović, Maximilian Baader, Martin Vechev

TL;DR

<3-5 sentence high-level summary> ToolFuzz addresses the reliability of LLM-based agents that rely on external tools by automatically testing the tool documentation for underspecification, overspecification, and ill-specification. It combines a taint-based fuzzing approach to provoke runtime errors with an invariance-based, synonymous-prompt methodology and an LLM oracle to detect correctness errors, forming two complementary error-detection pathways. The authors validate ToolFuzz on 32 LangChain tools and 35 custom tools, plus two new benchmarks (File Management and GitHub), showing the approach uncovers far more erroneous prompts and reduces false positives relative to prompt-based baselines. They also demonstrate that documentation fixes derived from ToolFuzz can improve agent task performance by about 10% in their benchmarks, highlighting practical benefits for building reliable AI agents.

Abstract

Large Language Model (LLM) Agents leverage the advanced reasoning capabilities of LLMs in real-world applications. To interface with an environment, these agents often rely on tools, such as web search or database APIs. As the agent provides the LLM with tool documentation along the user query, the completeness and correctness of this documentation is critical. However, tool documentation is often over-, under-, or ill-specified, impeding the agent's accuracy. Standard software testing approaches struggle to identify these errors as they are expressed in natural language. Thus, despite its importance, there currently exists no automated method to test the tool documentation for agents. To address this issue, we present ToolFuzz, the first method for automated testing of tool documentations. ToolFuzz is designed to discover two types of errors: (1) user queries leading to tool runtime errors and (2) user queries that lead to incorrect agent responses. ToolFuzz can generate a large and diverse set of natural inputs, effectively finding tool description errors at a low false positive rate. Further, we present two straightforward prompt-engineering approaches. We evaluate all three tool testing approaches on 32 common LangChain tools and 35 newly created custom tools and 2 novel benchmarks to further strengthen the assessment. We find that many publicly available tools suffer from underspecification. Specifically, we show that ToolFuzz identifies 20x more erroneous inputs compared to the prompt-engineering approaches, making it a key component for building reliable AI agents.

ToolFuzz -- Automated Agent Tool Testing

TL;DR

<3-5 sentence high-level summary> ToolFuzz addresses the reliability of LLM-based agents that rely on external tools by automatically testing the tool documentation for underspecification, overspecification, and ill-specification. It combines a taint-based fuzzing approach to provoke runtime errors with an invariance-based, synonymous-prompt methodology and an LLM oracle to detect correctness errors, forming two complementary error-detection pathways. The authors validate ToolFuzz on 32 LangChain tools and 35 custom tools, plus two new benchmarks (File Management and GitHub), showing the approach uncovers far more erroneous prompts and reduces false positives relative to prompt-based baselines. They also demonstrate that documentation fixes derived from ToolFuzz can improve agent task performance by about 10% in their benchmarks, highlighting practical benefits for building reliable AI agents.

Abstract

Large Language Model (LLM) Agents leverage the advanced reasoning capabilities of LLMs in real-world applications. To interface with an environment, these agents often rely on tools, such as web search or database APIs. As the agent provides the LLM with tool documentation along the user query, the completeness and correctness of this documentation is critical. However, tool documentation is often over-, under-, or ill-specified, impeding the agent's accuracy. Standard software testing approaches struggle to identify these errors as they are expressed in natural language. Thus, despite its importance, there currently exists no automated method to test the tool documentation for agents. To address this issue, we present ToolFuzz, the first method for automated testing of tool documentations. ToolFuzz is designed to discover two types of errors: (1) user queries leading to tool runtime errors and (2) user queries that lead to incorrect agent responses. ToolFuzz can generate a large and diverse set of natural inputs, effectively finding tool description errors at a low false positive rate. Further, we present two straightforward prompt-engineering approaches. We evaluate all three tool testing approaches on 32 common LangChain tools and 35 newly created custom tools and 2 novel benchmarks to further strengthen the assessment. We find that many publicly available tools suffer from underspecification. Specifically, we show that ToolFuzz identifies 20x more erroneous inputs compared to the prompt-engineering approaches, making it a key component for building reliable AI agents.

Paper Structure

This paper contains 60 sections, 29 figures, 5 tables.

Figures (29)

  • Figure 1: Overview of the two error detection techniques of ToolFuzz, consisting of (1) a fuzzing based approach and (2) an invariance based approach utilizing consistency checks. Prompts are denoted by $p$ or $p_j$, tool calls by $I_p$ or $I_j$, tool responses by $O_p$ or $O_j$ and agent responses by $a$ or $a_j$.
  • Figure 2: Input/Output overview for open_street_map_search tool evaluated with ToolFuzz. Note that the numbering corresponds to the numbering of the two approaches in \ref{['fig:toolfuzz-overview']}.
  • Figure 3: Example Implementation of the open-street-map-search tool.
  • Figure 4: High-level system diagram of an LLM (AI) Agent. The flow of the diagram follows the numbering: (1) User sends a query to the Agent. Alongside with the user query the Agent is provided with the description of all tools available to it (2). With this information the LLM plans (3) actions. Some of these actions require tool invocations (4). After tool calls, an observation is made (5). Based on the observation the agent responds with an answer to the User (6).
  • Figure 5: This bar chart illustrates the number of erroneous prompts identified by different methods across various tool categories. Each method is depicted as a stacked bar, with the upper segment showing the true positives (correctly identified erroneous prompts) and the lower segment indicating the false positives.
  • ...and 24 more figures