Table of Contents
Fetching ...

Common Ground Tracking in Multimodal Dialogue

Ibrahim Khebour, Kenneth Lai, Mariah Bradford, Yifan Zhu, Richard Brutti, Christopher Tam, Jingxuan Tu, Benjamin Ibarra, Nathaniel Blanchard, Nikhil Krishnaswamy, James Pustejovsky

TL;DR

This work tackles common ground tracking (CGT) in multi-party task-oriented dialogue, extending beyond traditional dialogue state tracking by modeling the shared belief space and questions under discussion (QUDs). It introduces a formal Common Ground Structure (CGS) with three banks—$QBank$, $EBank$, and $FBank$—and integrates an evidence-based dynamic epistemic logic to update beliefs as dialogue unfolds. The authors augment the Weights Task Dataset with gesture, action, and CG annotations and build a multimodal pipeline (move classifier, propositional extractor, and closure rules) to predict and propagate common-ground content, yielding a new benchmark for CGT. Results show that multimodal information improves CGT in several groups, though performance is highly group-dependent and future work should address per-speaker banks and cross-encoder propositional extraction to improve robustness and scalability.

Abstract

Within Dialogue Modeling research in AI and NLP, considerable attention has been spent on ``dialogue state tracking'' (DST), which is the ability to update the representations of the speaker's needs at each turn in the dialogue by taking into account the past dialogue moves and history. Less studied but just as important to dialogue modeling, however, is ``common ground tracking'' (CGT), which identifies the shared belief space held by all of the participants in a task-oriented dialogue: the task-relevant propositions all participants accept as true. In this paper we present a method for automatically identifying the current set of shared beliefs and ``questions under discussion'' (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth, establishing a benchmark in this novel, challenging task.

Common Ground Tracking in Multimodal Dialogue

TL;DR

This work tackles common ground tracking (CGT) in multi-party task-oriented dialogue, extending beyond traditional dialogue state tracking by modeling the shared belief space and questions under discussion (QUDs). It introduces a formal Common Ground Structure (CGS) with three banks—, , and —and integrates an evidence-based dynamic epistemic logic to update beliefs as dialogue unfolds. The authors augment the Weights Task Dataset with gesture, action, and CG annotations and build a multimodal pipeline (move classifier, propositional extractor, and closure rules) to predict and propagate common-ground content, yielding a new benchmark for CGT. Results show that multimodal information improves CGT in several groups, though performance is highly group-dependent and future work should address per-speaker banks and cross-encoder propositional extraction to improve robustness and scalability.

Abstract

Within Dialogue Modeling research in AI and NLP, considerable attention has been spent on ``dialogue state tracking'' (DST), which is the ability to update the representations of the speaker's needs at each turn in the dialogue by taking into account the past dialogue moves and history. Less studied but just as important to dialogue modeling, however, is ``common ground tracking'' (CGT), which identifies the shared belief space held by all of the participants in a task-oriented dialogue: the task-relevant propositions all participants accept as true. In this paper we present a method for automatically identifying the current set of shared beliefs and ``questions under discussion'' (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth, establishing a benchmark in this novel, challenging task.
Paper Structure (19 sections, 5 figures, 8 tables)

This paper contains 19 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Sample still from the Weights Task Dataset showing communication with multiple modalities. The accompanying utterance at this time is "Put the twenty on there; take off a ten".
  • Figure 2: Example dialogue. Participant 3 (right) says "looks like they're fairly equal" after placing the red and blue blocks on different sides of the scale. We refer back to this example elsewhere in the paper.
  • Figure 3: Move classifier architecture.
  • Figure 4: DSC for each bank aggregated across groups, plotted vs. utterance, using all modalities in the move classifier. [L]: propositional extraction performed using the multimodal CGA method. [R]: propositional extraction performed using the language-only Dense Paraphrase (DP) method.
  • Figure 5: Still of annotation procedure using ELAN.