Benchmarking Failures in Tool-Augmented Language Models

Eduardo Treviño; Hugo Contant; James Ngai; Graham Neubig; Zora Zhiruo Wang

Benchmarking Failures in Tool-Augmented Language Models

Eduardo Treviño, Hugo Contant, James Ngai, Graham Neubig, Zora Zhiruo Wang

TL;DR

The paper introduces Fail-TaLMs, a benchmark to study tool-augmented language models under practical failures: under-specified queries and unavailable tools. It builds a large tool environment (906 tools, 21 categories) and creates three data splits (perfect, under-specified, unavailable) totaling 1,749 examples, plus a No-Tools baseline, and defines metrics for pass rate, information/tool awareness, unexpected success, and skipped queries. Empirical results show that state-of-the-art models largely fail to recognize missing inputs or tools, though Claude often achieves higher awareness; awareness does not consistently translate to higher task success. The Ask-and-Help (AAH) protocol enables real-time human intervention, which substantially improves under-specified task performance, with some models (e.g., Llama-405B) reaching or surpassing perfect-setting results, but offers limited gains when tools are unavailable. The work highlights the value of human-in-the-loop strategies for robust TaLM behavior and provides a comprehensive dataset and methodology for evaluating practical TaLM failures and potential mitigations, while acknowledging limitations such as scalability and privacy concerns.

Abstract

The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume 'perfect' information access and tool availability, which may not hold in the real world. To systematically study TaLMs' imperfections, we introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TALMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help (AAH) method, to provide missing information or replace non-functional tools. While AAH can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.

Benchmarking Failures in Tool-Augmented Language Models

TL;DR

Abstract

Benchmarking Failures in Tool-Augmented Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)