Table of Contents
Fetching ...

Asking the Right Question at the Right Time: Human and Model Uncertainty Guidance to Ask Clarification Questions

Alberto Testoni, Raquel Fernández

TL;DR

This work examines whether model uncertainty aligns with human uncertainty in a collaborative CoDraw task and finds a weak correspondence between the two. It demonstrates that relying on human clarification decisions as supervision may not optimally resolve model uncertainty, and introduces QDrawer, an uncertainty-based clarification module that generates template-based questions when the model is uncertain. QDrawer achieves substantial gains in task success and calibration (e.g., size accuracy up to 87.3% and SS of 3.40, with improved ECE and Brier scores) compared to baselines. The results support incorporating self-assessed uncertainty into dialogue systems to improve grounding and clarify underspecifications, with implications for broader vision-language collaboration tasks.

Abstract

Clarification questions are an essential dialogue tool to signal misunderstanding, ambiguities, and under-specification in language use. While humans are able to resolve uncertainty by asking questions since childhood, modern dialogue systems struggle to generate effective questions. To make progress in this direction, in this work we take a collaborative dialogue task as a testbed and study how model uncertainty relates to human uncertainty -- an as yet under-explored problem. We show that model uncertainty does not mirror human clarification-seeking behavior, which suggests that using human clarification questions as supervision for deciding when to ask may not be the most effective way to resolve model uncertainty. To address this issue, we propose an approach to generating clarification questions based on model uncertainty estimation, compare it to several alternatives, and show that it leads to significant improvements in terms of task success. Our findings highlight the importance of equipping dialogue systems with the ability to assess their own uncertainty and exploit in interaction.

Asking the Right Question at the Right Time: Human and Model Uncertainty Guidance to Ask Clarification Questions

TL;DR

This work examines whether model uncertainty aligns with human uncertainty in a collaborative CoDraw task and finds a weak correspondence between the two. It demonstrates that relying on human clarification decisions as supervision may not optimally resolve model uncertainty, and introduces QDrawer, an uncertainty-based clarification module that generates template-based questions when the model is uncertain. QDrawer achieves substantial gains in task success and calibration (e.g., size accuracy up to 87.3% and SS of 3.40, with improved ECE and Brier scores) compared to baselines. The results support incorporating self-assessed uncertainty into dialogue systems to improve grounding and clarify underspecifications, with implications for broader vision-language collaboration tasks.

Abstract

Clarification questions are an essential dialogue tool to signal misunderstanding, ambiguities, and under-specification in language use. While humans are able to resolve uncertainty by asking questions since childhood, modern dialogue systems struggle to generate effective questions. To make progress in this direction, in this work we take a collaborative dialogue task as a testbed and study how model uncertainty relates to human uncertainty -- an as yet under-explored problem. We show that model uncertainty does not mirror human clarification-seeking behavior, which suggests that using human clarification questions as supervision for deciding when to ask may not be the most effective way to resolve model uncertainty. To address this issue, we propose an approach to generating clarification questions based on model uncertainty estimation, compare it to several alternatives, and show that it leads to significant improvements in terms of task success. Our findings highlight the importance of equipping dialogue systems with the ability to assess their own uncertainty and exploit in interaction.
Paper Structure (24 sections, 1 equation, 5 figures, 6 tables)

This paper contains 24 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of our experimental setup in CoDraw. After receiving an instruction from the Teller, the Drawer agent selects the clipart(s) to draw, together with their attributes. If the entropy over an attribute exceeds a threshold $\theta$ (size in the figure), then the Drawer asks a clarification question. The question-answer pair is added to the dialogue history before performing the next drawing action. Different human players may react differently (bottom-left box); the agent decides whether to ask for clarification on the basis of its own uncertainty, independently from the clarification decisions of human players.
  • Figure 2: Average Precision (AP, green) of a logistic regression model in predicting when human players asked clarification questions based on the uncertainty extracted from a Drawer trained for a variable number of epochs ($x$-axis). The plot also shows the similarity score (SS, blue) achieved by the Drawer, and its uncertainty on clipart selection (normalized and reversed clipart scores), and on the size and orientation attributes (as measured by the entropy H). For each epoch, we report the average and variance of 5 Drawer models trained with different seeds. Note that the $y$-axis has two scales: we report SS on the left, while all the other metrics refer to the right side.
  • Figure 3: Example of clarification exchanges (in green) with questions generated by QDrawer and answers by the Teller. The images on the right depict the QDrawer’s canvas after each round of conversation.
  • Figure 4: Effect of the average number of questions per dialogue (considering all dialogues in the test set) on the size accuracy. We compare the uncertainty-guided QDrawer with a version that asks questions in random turns and one that asks questions in the turns selected by an external Decider model.
  • Figure 5: Examples of dialogues and clarification questions asked by different models. T stands for Teller, SD for Silent Drawer, and QD for QDrawer. Clarification exchanges are highlighted in green.