Non-Collaborative User Simulators for Tool Agents
Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo
TL;DR
<3-5 sentence high-level summary> The paper tackles the gap in tool-agent evaluation by introducing a non-collaborative user simulator that models four realistic user behaviors (Unavailable Services, Tangential falls, Impatience, and Incomplete Utterances) while preserving task goals. It builds on collaborative simulators with an LLM-based framework that includes a dialogue state tracker, information sharding, and an ending verifier to ensure goal-aligned interaction. Through experiments on MultiWOZ and τ-bench with multiple LLMs, the study reveals substantial performance degradation for state-of-the-art agents under non-collaborative conditions and analyzes behavior-specific failure modes such as API hallucinations and excessive apologies. The framework demonstrates extensibility across benchmarks and domains, offering a practical tool for diagnosing and strengthening tool agents against real-world, non-cooperative user behavior.
Abstract
Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $τ$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents' weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.
