InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context
Bryan L. M. de Oliveira, Luana G. B. Martins, Bruno Brandão, Luckeciano C. Melo
TL;DR
This paper addresses the difficulty of handling ambiguous, underspecified user requests in open-ended dialogue by introducing InfoQuest, a benchmark that evolves multi-turn interactions with hidden context. InfoQuest generates diverse, ambiguous scenarios from predefined personas and evaluates agents through a three-stage pipeline (initial state, user simulation, verification) within a POMDP-like framework, allowing targeted information-seeking behavior via clarifying questions. The authors provide a robust methodology, release a dataset of 1,000 scenarios with logs, and present comprehensive experiments across proprietary and open models, revealing that even strong models struggle with efficient information gathering and tend to default to generic responses. The work offers practical insights into model limitations, supports automatic data generation for self-improvement, and lays groundwork for developing more interactive, context-aware conversational agents with improved clarification capabilities.
Abstract
Large language models excel at following explicit instructions, but they often struggle with ambiguous or incomplete user requests, defaulting to verbose, generic responses instead of seeking clarification. We introduce InfoQuest, a multi-turn chat benchmark designed to evaluate how dialogue agents handle hidden context in open-ended user requests. This benchmark presents intentionally ambiguous scenarios that require models to engage in information-seeking dialogue by asking clarifying questions before providing appropriate responses. Our evaluation of both open and closed models reveals that, while proprietary models generally perform better, all current assistants struggle to gather critical information effectively. They often require multiple turns to infer user intent and frequently default to generic responses without proper clarification. We provide a systematic methodology for generating diverse scenarios and evaluating models' information-seeking capabilities, which can be leveraged to automatically generate data for self-improvement. We also offer insights into the current limitations of language models in handling ambiguous requests through multi-turn interactions.
