Table of Contents
Fetching ...

Tools Fail: Detecting Silent Errors in Faulty Tools

Jimin Sun, So Yeon Min, Yingshan Chang, Yonatan Bisk

TL;DR

A framework for tools is introduced which guides us to explore a model’s ability to detect “silent” tool errors, and reflect on how to plan, which more directly aligns with the increasingly popular use of models as tools.

Abstract

Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model's ability to detect "silent" tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.

Tools Fail: Detecting Silent Errors in Faulty Tools

TL;DR

A framework for tools is introduced which guides us to explore a model’s ability to detect “silent” tool errors, and reflect on how to plan, which more directly aligns with the increasingly popular use of models as tools.

Abstract

Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model's ability to detect "silent" tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.
Paper Structure (42 sections, 3 equations, 12 figures, 5 tables)

This paper contains 42 sections, 3 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: (a) Tool-use Overview: Starting from an input $x$, the LLM generates inputs $i$ for the selected tool, and incorporates tool outputs $o$ to predict the final task output $\Hat{y}$. The context $c$ is used throughout the task. (b) Correct Calculator Incorrect tool inputs from the LLM leads to tool failure. The error messages can be leveraged for correction (Refine). (c) Broken Calculator Tool inputs are correct, but the tool itself silently produces false outputs. (d) ALFRED The first tool, Object Detector, misidentifies the Tomato in the image as an Apple, leading to error cascades in the next tool, the Action Planner.
  • Figure 2: Prompt for a math problem using tool outputs. The result 25 is perturbed in the Broken scenario: Digit replacement, Magnitude shift, or Sign inversion.
  • Figure 3: Math accuracy of models. The black bar indicates the best accuracy without tool-use; upward orange/downward arrows respectively indicate performance with correct/broken tool-use.
  • Figure 4: The rejection rate on the perturbed calculator outputs with respect to six features.
  • Figure 5: Evaluating two tool outputs in ALFRED -- Action Planner (Left) and Object Detector (Right). The LLM is asked whether to Accept/Reject the tool output, based on the provided image and task context.
  • ...and 7 more figures