Table of Contents
Fetching ...

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo

TL;DR

ToolHaystack introduces a composable, long-horizon benchmark for stress-testing tool-augmented language models in realistic, noisy multi-session interactions. By interleaving target tool-use tasks with distractors (haystack) and modeling context recall, information shifts, and missing context, the benchmark exposes robustness gaps that are not captured by traditional multi-turn tests. Across 14–17 TALMs, results show substantial performance drops in long-term use, with pronounced sensitivity to distractors, context position, and goal drift, underscoring the need for dedicated long-context evaluation and improved memory-guided reasoning in agentive LLMs. The study also provides a scalable data-generation pipeline (tool collection, sequence construction, dialogue instantiation) and rich diagnostic analyses to guide future improvements in robust tool-use and real-world deployment of TALMs.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

TL;DR

ToolHaystack introduces a composable, long-horizon benchmark for stress-testing tool-augmented language models in realistic, noisy multi-session interactions. By interleaving target tool-use tasks with distractors (haystack) and modeling context recall, information shifts, and missing context, the benchmark exposes robustness gaps that are not captured by traditional multi-turn tests. Across 14–17 TALMs, results show substantial performance drops in long-term use, with pronounced sensitivity to distractors, context position, and goal drift, underscoring the need for dedicated long-context evaluation and improved memory-guided reasoning in agentive LLMs. The study also provides a scalable data-generation pipeline (tool collection, sequence construction, dialogue instantiation) and rich diagnostic analyses to guide future improvements in robust tool-use and real-world deployment of TALMs.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

Paper Structure

This paper contains 60 sections, 1 equation, 8 figures, 9 tables.

Figures (8)

  • Figure 1: ToolHaystack addresses long-term interactions that include evolving goals, semantic noise, and fragmented context.
  • Figure 2: Real-world interactions between human and agent are intertwined, with natural contextual noise accumulated over time.
  • Figure 3: Illustration of the scenarios in ToolHaystack (CR, IS, MC).
  • Figure 4: Overview of our three-stage dataset generation pipeline.
  • Figure 5: Performance comparison between BFCL (multi-turn) and ToolHaystack (long-term) score.
  • ...and 3 more figures