Table of Contents
Fetching ...

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Zeyu Jin, Songtao Zhou, Haoyu Wang, Minghao Tian, Kaifeng Yun, Zhuo Chen, Xiaoyu Qin, Jia Jia

Abstract

The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Abstract

The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.

Paper Structure

This paper contains 29 sections, 1 equation, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: An example dialogue clip from MM-Dia and MM-Dia-Bench with hierarchical (sentence- and dialogue-level) annotations, featuring rich multimodal dialogue interaction details. The right panel depicts three multimodal dialogue generation tasks involving text ($\mathcal{T}$), audio ($\mathcal{A}$), vision ($\mathcal{V}$) and dialogue style ($\mathcal{D}$), demonstrating both explicit (Task 1) & implicit control (Task 2,3).
  • Figure 2: Framework of the Movie/TV-sourced in-the-wild data curation pipeline for multimodal dialogue extraction with fine-grained interaction-level annotations.
  • Figure 4: Three bad/good cases of subtitle alignment: (a) edited movie segments, (b) edited movie speed, and (c) potential usability with time translation. In each figure, the upper plot shows the start time discrepancy between anchor point start times in the subtitle and the ASR results, and the lower plot shows the duration discrepancy.
  • Figure : (a)
  • Figure : (a)
  • ...and 2 more figures