Table of Contents
Fetching ...

InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

Dongwei Pan, Longwei Guo, Jiazhi Guan, Luying Huang, Yiding Li, Haojie Liu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou

Abstract

Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.

InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

Abstract

Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.
Paper Structure (36 sections, 7 equations, 7 figures, 4 tables)

This paper contains 36 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Dyadic-Conversational Video Generated by InterDyad. Our method generates visual-audio synchronized conversational videos conditioned on a single reference frame of two subjects and dual-track driving audio, either replicating interaction patterns from a reference video (yellow results) or synthesizing plausible interactions directly from the driving audio (blue results).
  • Figure 2: Overview of our InterDyad framework.Top: The overall architecture is constructed by sequentially stacking Transformer blocks and trained via denoising to synthesize dyadic conversational videos with synchronized audio-visual human dynamics and coherent inter-subject interactions. Bottom: We illustrate the interactivity injection mechanism, which leverages switchable multimodal inputs to enable rich and controllable synthesis of interactive motion patterns.
  • Figure 3: RoDG. RoDG uses dual-track VAD to assign Speaker and Listener over time, boosting audio guidance on the speaker's lip Gaussian while suppressing the listener to avoid cross-talk.
  • Figure 4: DI-Sync. Audio union of prosodic emphasis and Video union of reactive behaviors are extracted by MLLM, then Temporal Intersection of Union(TIoU) are calculated.
  • Figure 5: Qualitative comparison with existing baselines. The figure illustrates the generated inter-subject dynamics from different methods, where the numbers in red dots indicate three different timestep. Accumulated motion heatmaps are provided in the bottom row to visualize reaction intensity.
  • ...and 2 more figures