Table of Contents
Fetching ...

Learning Steerable Clarification Policies with Collaborative Self-play

Jonathan Berant, Maximillian Chen, Adam Fisch, Reza Aghajani, Fantine Huot, Mirella Lapata, Jacob Eisenstein

TL;DR

This work tackles uncertainty in AI assistants by learning steerable grounding policies that adapt to context and user preferences. It introduces a cost-conditioned training objective and uses Reinforced Self-Training (ReST) with collaborative user–assistant self-play to teach a single model to decide when to clarify, enumerate interpretations, or answer directly. Across AmbigQA and Pacific benchmarks, the steerable policy improves cost-efficient accuracy, reduces unnecessary clarifications, and generalizes to unseen cost coefficients. The approach lays groundwork for context-sensitive interaction strategies and suggests extensions to richer grounding modalities and downstream tool-use control.

Abstract

To handle underspecified or ambiguous queries, AI assistants need a policy for managing their uncertainty to determine (a) when to guess the user intent and answer directly, (b) when to enumerate and answer multiple possible intents, and (c) when to ask a clarifying question. However, such policies are contextually dependent on factors such as user preferences or modality. For example, enumerating multiple possible user intentions is cumbersome on small screens or in a voice setting. In this work, we propose to train steerable policies for managing this uncertainty using self-play. Given two agents, one simulating a user and the other an AI assistant, we generate conversations where the user issues a potentially ambiguous query, and the assistant needs to determine how to respond. Importantly, the model takes as input the numerical cost of each clarification question, and each generated word, and is asked to take the action that will maximize its final reward, which is the cost-penalized accuracy. We use Reinforced Self-Training (ReST) to train our model to achieve high reward and show this leads to a steerable policy that changes its behavior predictably conditioned on the provided costs, leading to higher reward and accuracy. Moreover, our procedure also generalizes to numerical cost values that were unobserved at training time.

Learning Steerable Clarification Policies with Collaborative Self-play

TL;DR

This work tackles uncertainty in AI assistants by learning steerable grounding policies that adapt to context and user preferences. It introduces a cost-conditioned training objective and uses Reinforced Self-Training (ReST) with collaborative user–assistant self-play to teach a single model to decide when to clarify, enumerate interpretations, or answer directly. Across AmbigQA and Pacific benchmarks, the steerable policy improves cost-efficient accuracy, reduces unnecessary clarifications, and generalizes to unseen cost coefficients. The approach lays groundwork for context-sensitive interaction strategies and suggests extensions to richer grounding modalities and downstream tool-use control.

Abstract

To handle underspecified or ambiguous queries, AI assistants need a policy for managing their uncertainty to determine (a) when to guess the user intent and answer directly, (b) when to enumerate and answer multiple possible intents, and (c) when to ask a clarifying question. However, such policies are contextually dependent on factors such as user preferences or modality. For example, enumerating multiple possible user intentions is cumbersome on small screens or in a voice setting. In this work, we propose to train steerable policies for managing this uncertainty using self-play. Given two agents, one simulating a user and the other an AI assistant, we generate conversations where the user issues a potentially ambiguous query, and the assistant needs to determine how to respond. Importantly, the model takes as input the numerical cost of each clarification question, and each generated word, and is asked to take the action that will maximize its final reward, which is the cost-penalized accuracy. We use Reinforced Self-Training (ReST) to train our model to achieve high reward and show this leads to a steerable policy that changes its behavior predictably conditioned on the provided costs, leading to higher reward and accuracy. Moreover, our procedure also generalizes to numerical cost values that were unobserved at training time.

Paper Structure

This paper contains 30 sections, 4 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: We train a steerable AI assistant that changes its behavior given cost coefficients associated with different dimensions of the conversation. The model is trained to maximize cost-penalized accuracy, with costs specified by $\alpha$ and $\beta$. In this example, the model asks a clarification question when the cost of additional conversation turns is low (left), enumerates multiple interpretations of the question when the cost of generating a long response is low (middle), and answers directly with an educated guess when the cost of both is high (right).
  • Figure 2: An example rollout. The environment samples an interpretation for an ambiguous query and passes the ambiguous query $q$ and its unambiguous interpretation $i$ to the user. The user passes $q$ to the assistant, which issues a clarification question, and after getting a clarification response outputs a multi-answer (answer that covers multiple interpretations), which is used to formulate the final answer. The user simulator uses its knowledge of $i$ to take the multi-answer and choose the part relevant for the answer.
  • Figure 3: An example where the assistant has access to a private context (table in this case), which is necessary for determining ambiguity of the query. The user does not have access to this context.
  • Figure 4: A Pacific example where SGP asks a clarification question that leads to higher reward compared to the Prompted baseline. Context is omitted for brevity; column names from the table are in bold. The table contains the column Transition costs and project assets twice: once under the category Other Current Assets and once under the category Other Assets. The answer is "2020" for the correct category, and "2018" for the wrong one.
  • Figure 5: Fraction of rollouts with a clarification question (left), fraction of rollouts with a multi-answer (middle), and average length of the final assistant answer (right) for Prompted, Prompted-COT, and SGP on the development sets of AmbigQA (top) and Pacific (bottom) for different values of cost coefficients. The $x$-axis for $\alpha$ is in square-root scale and for $\beta$ in logarithmic scale. In both datasets, SGP reduces the fraction of clarifications with higher $\alpha$ and reduces the fraction of multi-answers with higher $\beta$, as expected. Prompted and Prompted-COT do not exhibit this correct sensitivity to values of cost coefficients.
  • ...and 6 more figures