Table of Contents
Fetching ...

DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

Jia Li, Yangchen Yu, Yin Chen, Yu Zhang, Peng Jia, Yunbo Xu, Ziqiang Li, Meng Wang, Richang Hong

TL;DR

The paper addresses frame-level engagement estimation in multi-person conversations using a language-independent approach. It introduces a Dialogue-Aware Transformer (DAT) with Modality-Group Fusion (MGF) to independently fuse audio and visual cues per participant, and a Dialogue-Aware Encoder (DAE) that incorporates conversational partner information via cross-attention. Through extensive experiments on NoXi Base, NoXi-Add, and MPIIGI (MultiMediate'24), the method achieves state-of-the-art CCC results (e.g., 0.76 on NoXi Base and an average 0.64 across datasets), with ablations showing clear gains from MG F and DAE and their synergy. The approach highlights the importance of partner cues and modality-specific fusion for robust engagement estimation in real-world dialogues, providing a solid baseline for future dialogue-centric multimodal research.

Abstract

Engagement estimation plays a crucial role in understanding human social behaviors, attracting increasing research interests in fields such as affective computing and human-computer interaction. In this paper, we propose a Dialogue-Aware Transformer framework (DAT) with Modality-Group Fusion (MGF), which relies solely on audio-visual input and is language-independent, for estimating human engagement in conversations. Specifically, our method employs a modality-group fusion strategy that independently fuses audio and visual features within each modality for each person before inferring the entire audio-visual content. This strategy significantly enhances the model's performance and robustness. Additionally, to better estimate the target participant's engagement levels, the introduced Dialogue-Aware Transformer considers both the participant's behavior and cues from their conversational partners. Our method was rigorously tested in the Multi-Domain Engagement Estimation Challenge held by MultiMediate'24, demonstrating notable improvements in engagement-level regression precision over the baseline model. Notably, our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.

DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

TL;DR

The paper addresses frame-level engagement estimation in multi-person conversations using a language-independent approach. It introduces a Dialogue-Aware Transformer (DAT) with Modality-Group Fusion (MGF) to independently fuse audio and visual cues per participant, and a Dialogue-Aware Encoder (DAE) that incorporates conversational partner information via cross-attention. Through extensive experiments on NoXi Base, NoXi-Add, and MPIIGI (MultiMediate'24), the method achieves state-of-the-art CCC results (e.g., 0.76 on NoXi Base and an average 0.64 across datasets), with ablations showing clear gains from MG F and DAE and their synergy. The approach highlights the importance of partner cues and modality-specific fusion for robust engagement estimation in real-world dialogues, providing a solid baseline for future dialogue-centric multimodal research.

Abstract

Engagement estimation plays a crucial role in understanding human social behaviors, attracting increasing research interests in fields such as affective computing and human-computer interaction. In this paper, we propose a Dialogue-Aware Transformer framework (DAT) with Modality-Group Fusion (MGF), which relies solely on audio-visual input and is language-independent, for estimating human engagement in conversations. Specifically, our method employs a modality-group fusion strategy that independently fuses audio and visual features within each modality for each person before inferring the entire audio-visual content. This strategy significantly enhances the model's performance and robustness. Additionally, to better estimate the target participant's engagement levels, the introduced Dialogue-Aware Transformer considers both the participant's behavior and cues from their conversational partners. Our method was rigorously tested in the Multi-Domain Engagement Estimation Challenge held by MultiMediate'24, demonstrating notable improvements in engagement-level regression precision over the baseline model. Notably, our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.

Paper Structure

This paper contains 14 sections, 14 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The task of engagement estimation muller2024multimediate aims to predict the continuous values of a target participant's engagement level frame-by-frame in a conversation. Contextual segmentation is performed on the entire recording sequence to include more contextual information around a specific time period for prediction.
  • Figure 2: Overall architecture of the proposed method. Our DAT consists of two main modules: Modality-Group Fusion and Dialogue-Aware Encoder. Firstly, the Modality-Group Fusion module processes audio and visual features for both the participant and partner. Each feature is processed through a Transformer before being fused together. Subsequently, the Dialogue-Aware Encoder utilizes cross-attention to combine and encode information from both participants, focusing on contextual interactions to enhance engagement prediction. Finally, an MLP predicts continuous engagement levels frame-by-frame by utilizing the encoded features.
  • Figure 3: Ablation on the unified mapping space dimension $d$.
  • Figure 4: Using our method for real-time fitting, the selected interval length is fixed at 5000 samples.