Table of Contents
Fetching ...

Non-Collaborative User Simulators for Tool Agents

Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo

TL;DR

<3-5 sentence high-level summary> The paper tackles the gap in tool-agent evaluation by introducing a non-collaborative user simulator that models four realistic user behaviors (Unavailable Services, Tangential falls, Impatience, and Incomplete Utterances) while preserving task goals. It builds on collaborative simulators with an LLM-based framework that includes a dialogue state tracker, information sharding, and an ending verifier to ensure goal-aligned interaction. Through experiments on MultiWOZ and τ-bench with multiple LLMs, the study reveals substantial performance degradation for state-of-the-art agents under non-collaborative conditions and analyzes behavior-specific failure modes such as API hallucinations and excessive apologies. The framework demonstrates extensibility across benchmarks and domains, offering a practical tool for diagnosing and strengthening tool agents against real-world, non-cooperative user behavior.

Abstract

Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $τ$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.

Non-Collaborative User Simulators for Tool Agents

TL;DR

<3-5 sentence high-level summary> The paper tackles the gap in tool-agent evaluation by introducing a non-collaborative user simulator that models four realistic user behaviors (Unavailable Services, Tangential falls, Impatience, and Incomplete Utterances) while preserving task goals. It builds on collaborative simulators with an LLM-based framework that includes a dialogue state tracker, information sharding, and an ending verifier to ensure goal-aligned interaction. Through experiments on MultiWOZ and τ-bench with multiple LLMs, the study reveals substantial performance degradation for state-of-the-art agents under non-collaborative conditions and analyzes behavior-specific failure modes such as API hallucinations and excessive apologies. The framework demonstrates extensibility across benchmarks and domains, offering a practical tool for diagnosing and strengthening tool agents against real-world, non-cooperative user behavior.

Abstract

Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and -bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.

Paper Structure

This paper contains 111 sections, 7 figures, 26 tables.

Figures (7)

  • Figure 1: Overall structure of non-collaborative user simulation environment. This includes the tool agent environment, collaborative user simulator, and non-collaborative user simulation modules.
  • Figure 2: The overall structure of the collaborative user simulator. It illustrates the components used by the user simulator to generate utterances and shows all interactions between the modules.
  • Figure 3: The user simulator adjustment method for each non-collaborative user simulation. This illustrates the entire non-collaborative behavior simulation method we defined.
  • Figure 4: SFT training with collaborative and non-collaborative user simulation
  • Figure 5: Human evaluation between PBUS and our user simulator
  • ...and 2 more figures