Table of Contents
Fetching ...

MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, Yebin Liu

TL;DR

MAViD introduces a Conductor–Creator framework that decouples multimodal dialogue understanding from joint audio–visual content generation. The Conductor produces speech and motion instructions from text, audio, and video, while the Creator uses an autoregressive–diffusion hybrid to generate long-duration, synchronized AV content, enhanced by a fusion module that links consecutive clips and modalities. The approach addresses limitations of two-stage AV generation and short clip lengths, demonstrating strong multimodal understanding and improved long-sequence AV realism. Extensive experiments show MAViD can produce ~30-second AV content with coherent identity, timbre, and tonal consistency, including general environmental sounds.

Abstract

We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.

MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

TL;DR

MAViD introduces a Conductor–Creator framework that decouples multimodal dialogue understanding from joint audio–visual content generation. The Conductor produces speech and motion instructions from text, audio, and video, while the Creator uses an autoregressive–diffusion hybrid to generate long-duration, synchronized AV content, enhanced by a fusion module that links consecutive clips and modalities. The approach addresses limitations of two-stage AV generation and short clip lengths, demonstrating strong multimodal understanding and improved long-sequence AV realism. Extensive experiments show MAViD can produce ~30-second AV content with coherent identity, timbre, and tonal consistency, including general environmental sounds.

Abstract

We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.

Paper Structure

This paper contains 11 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of MAViD. Left: The Conductor-Creator architecture. The Conductor takes users' inquiries across text, audio, and video as input, understands them, and outputs textual instructions. To achieve fine-grained control over video generation, these textual instructions are further decoupled into speech-oriented and motion-oriented instructions. In Creator, the decoupled instructions guide the joint audio-video generation. Specifically, we employ a structure combining autoregressive (AR) and diffusion models to model long sequences and maintain visual quality. Right: To ensure consistency and coherence in long-sequence joint audio-video generation, we propose a fusion module that integrates features from both AR and diffusion. The figure illustrates the use of attention for interaction among interleaved audio and video clips, with yellow parts indicating where attention needs to be computed.
  • Figure 2: Visual comparison of long videos. For each method, we generate approximately 600 frames of video using the same audio and video prompts. Ovi low2025ovi and Universe-1 wang2025universe (Uni.) employ multiple rounds of inference, using the last frame of the previous clip as the reference image. Different audio colors represent variations in timbre and tone.
  • Figure 3: Ablation experiment of the fusion module.