Table of Contents
Fetching ...

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

Maximillian Chen, Ruoxi Sun, Tomas Pfister, Sercan Ö. Arık

TL;DR

This work tackles the challenge of learning pragmatic, ambiguity-resolving behavior in multi-turn LLM conversations under data constraints. It introduces Action-Based Contrastive Self-Training (ACT), a quasi-online, DPO-inspired method that trains dialogue policies by contrasting action choices and leveraging on-policy trajectory simulations. Across three diverse tasks—PACIFIC, Abg-CoQA, and AmbigSQL—ACT demonstrates superior data-efficient performance and improved multi-turn task completion, including in settings with unlabeled data. The results validate the importance of action-level contrastive learning and trajectory-aware supervision for robust conversational agents, and they present a practical workflow for evaluating ambiguity handling in LLMs.

Abstract

Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

TL;DR

This work tackles the challenge of learning pragmatic, ambiguity-resolving behavior in multi-turn LLM conversations under data constraints. It introduces Action-Based Contrastive Self-Training (ACT), a quasi-online, DPO-inspired method that trains dialogue policies by contrasting action choices and leveraging on-policy trajectory simulations. Across three diverse tasks—PACIFIC, Abg-CoQA, and AmbigSQL—ACT demonstrates superior data-efficient performance and improved multi-turn task completion, including in settings with unlabeled data. The results validate the importance of action-level contrastive learning and trajectory-aware supervision for robust conversational agents, and they present a practical workflow for evaluating ambiguity handling in LLMs.

Abstract

Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.
Paper Structure (60 sections, 2 equations, 7 figures, 31 tables, 2 algorithms)

This paper contains 60 sections, 2 equations, 7 figures, 31 tables, 2 algorithms.

Figures (7)

  • Figure 1: Simplified example of ambiguity present at tabular-grounded conversational question answering based on deng2022pacific. A conversational agent should recognize when there is ambiguity and ask a clarifying question towards a more accurate final answer.
  • Figure 2: ACT greatly outperforms standard tuning approaches in data-efficient settings for conversational modeling, as exemplified here on PACIFIC.
  • Figure 3: Overview of the tuning phase of ACT. For each initial contrastive pairing from $D_{pref}$ (constructed as in Sec. \ref{['sec:preference_data_construction']}), we sample an on-policy response from the model being tuned. After evaluating the sampled response's trajectory, we update the contrastive pairing by either replacing the existing winning or losing response. The model policy is updated using the objective in Eq. \ref{['eq:dpo_objective']}.
  • Figure : Building Contrastive Action Pairs
  • Figure : ACT: Action-Based Contrastive Self-Training
  • ...and 2 more figures