Table of Contents
Fetching ...

InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context

Bryan L. M. de Oliveira, Luana G. B. Martins, Bruno Brandão, Luckeciano C. Melo

TL;DR

This paper addresses the difficulty of handling ambiguous, underspecified user requests in open-ended dialogue by introducing InfoQuest, a benchmark that evolves multi-turn interactions with hidden context. InfoQuest generates diverse, ambiguous scenarios from predefined personas and evaluates agents through a three-stage pipeline (initial state, user simulation, verification) within a POMDP-like framework, allowing targeted information-seeking behavior via clarifying questions. The authors provide a robust methodology, release a dataset of 1,000 scenarios with logs, and present comprehensive experiments across proprietary and open models, revealing that even strong models struggle with efficient information gathering and tend to default to generic responses. The work offers practical insights into model limitations, supports automatic data generation for self-improvement, and lays groundwork for developing more interactive, context-aware conversational agents with improved clarification capabilities.

Abstract

Large language models excel at following explicit instructions, but they often struggle with ambiguous or incomplete user requests, defaulting to verbose, generic responses instead of seeking clarification. We introduce InfoQuest, a multi-turn chat benchmark designed to evaluate how dialogue agents handle hidden context in open-ended user requests. This benchmark presents intentionally ambiguous scenarios that require models to engage in information-seeking dialogue by asking clarifying questions before providing appropriate responses. Our evaluation of both open and closed models reveals that, while proprietary models generally perform better, all current assistants struggle to gather critical information effectively. They often require multiple turns to infer user intent and frequently default to generic responses without proper clarification. We provide a systematic methodology for generating diverse scenarios and evaluating models' information-seeking capabilities, which can be leveraged to automatically generate data for self-improvement. We also offer insights into the current limitations of language models in handling ambiguous requests through multi-turn interactions.

InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context

TL;DR

This paper addresses the difficulty of handling ambiguous, underspecified user requests in open-ended dialogue by introducing InfoQuest, a benchmark that evolves multi-turn interactions with hidden context. InfoQuest generates diverse, ambiguous scenarios from predefined personas and evaluates agents through a three-stage pipeline (initial state, user simulation, verification) within a POMDP-like framework, allowing targeted information-seeking behavior via clarifying questions. The authors provide a robust methodology, release a dataset of 1,000 scenarios with logs, and present comprehensive experiments across proprietary and open models, revealing that even strong models struggle with efficient information gathering and tend to default to generic responses. The work offers practical insights into model limitations, supports automatic data generation for self-improvement, and lays groundwork for developing more interactive, context-aware conversational agents with improved clarification capabilities.

Abstract

Large language models excel at following explicit instructions, but they often struggle with ambiguous or incomplete user requests, defaulting to verbose, generic responses instead of seeking clarification. We introduce InfoQuest, a multi-turn chat benchmark designed to evaluate how dialogue agents handle hidden context in open-ended user requests. This benchmark presents intentionally ambiguous scenarios that require models to engage in information-seeking dialogue by asking clarifying questions before providing appropriate responses. Our evaluation of both open and closed models reveals that, while proprietary models generally perform better, all current assistants struggle to gather critical information effectively. They often require multiple turns to infer user intent and frequently default to generic responses without proper clarification. We provide a systematic methodology for generating diverse scenarios and evaluating models' information-seeking capabilities, which can be leveraged to automatically generate data for self-improvement. We also offer insights into the current limitations of language models in handling ambiguous requests through multi-turn interactions.

Paper Structure

This paper contains 21 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Naive vs. information-seeking agents handling ambiguous user requests. Left: context. Center: naive agent's verbose response. Right: information-seeking agent's targeted questions.
  • Figure 2: InfoQuest's three-stage benchmark construction process. Left: initial state generation by selecting personas and creating ambiguous messages. Center: user setting with persona traits, goals, obstacles and constraints. Right: generation of a checklist to evaluate information gathering.
  • Figure 3: Average cumulative reward of diverse dialogue agents on InfoQuest. Top row: performance across all episodes. Bottom row: performance on the worst 25% of episodes, where performance gaps become more apparent. Models are grouped by category: large & reasoning models (left), mid-sized proprietary models (center), and open models (right). While all methods achieve non-trivial performance, there remains significant room for improvement in handling hidden context in open-ended conversations.
  • Figure 4: Conversation length distribution for top 25% of episodes. Claude 3.7 Sonnet and Gemini 1.5 Flash demonstrate superior efficiency, typically resolving queries in 6 turns for most high-performing episodes. In contrast, Gemini 2.5 Pro (thinking) frequently requires the maximum number of turns. Notably, all models exceed the ideal 5-turn threshold, often reaching the 10-turn conversation limit.
  • Figure 5: Extended Conversation Performance of Falcon3-7B-Instruct. The model's average cumulative reward per turn plateaus below the maximum, even after 30 turns, highlighting persistent challenges in sustaining effective information-seeking strategies throughout prolonged dialogues.