Table of Contents
Fetching ...

Talking Together: Synthesizing Co-Located 3D Conversations from Audio

Mengyi Shan, Shouchieh Chang, Ziqian Bai, Shichen Liu, Yinda Zhang, Luchuan Song, Rohit Pandey, Sean Fanello, Zeng Huang

TL;DR

This work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues, and significantly outperforming existing baselines in perceived realism and interaction coherence.

Abstract

We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied "talking heads" akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.

Talking Together: Synthesizing Co-Located 3D Conversations from Audio

TL;DR

This work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues, and significantly outperforming existing baselines in perceived realism and interaction coherence.

Abstract

We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied "talking heads" akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.
Paper Structure (29 sections, 6 equations, 6 figures, 7 tables)

This paper contains 29 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Our method in context. Left: Prior work generates isolated participants, such as Speaker-Only models that lack listener reactions, Listener-Only models that do not model multi-round interactions, or Conversation models that resemble a "video conference" call, failing to model a shared 3D space, spatial layout, or real, physically meaningful eye interaction. Right: Our method takes a single mixed conversational audio stream and a text-based environment prompt to generate the complete, co-located 3D performance of both participants. Our model is the first to explicitly synthesize this crucial spatial relationship, enabling realistic outputs with natural mutual gaze and a controllable spatial layout.
  • Figure 1: Comparison of 3D talking head datasets. Existing datasets differ in scale, interaction diversity, and lip accuracy. Our Dyadic Conversation and Synthetic Dubbing datasets combine large-scale interactive scenes with accurate lip motion and identity consistency, enabling joint learning of interaction and high-fidelity speech animation.
  • Figure 2: An overview of our dual-stream diffusion architecture. The model employs a shared U-Net backbone to process the noisy input streams for both participants in parallel. It is conditioned on features from the mixed audio, learnable role embeddings, and the speaker probability masks. Dual-speaker cross-attention layers within the decoder allow the two streams to exchange information, modeling the interaction dynamics. The network outputs the predicted 3D animation parameters for expression ($\boldsymbol{\psi}$), rotation ($\boldsymbol{\theta}$)), and translation ($\boldsymbol{t}$). For conversation data, we utilize losses for all expression, rotation and translation prediction, with auxiliary eye gaze loss on samples with large-scale head movements. For synthetic data, we finetune with losses only on speaker's lip parameters.
  • Figure 3: Our Data Curation Pipeline. This figure illustrates our two-pronged approach to dataset creation. (a) Dyadic Conversation Dataset: We process raw conversational videos through various filters to reconstruct 3D facial parameters and extract speaker masks. (b) Synthetic Dubbing Dataset: We generate clean pseudo-conversations by taking single-person videos, applying random cuts, and re-assembling them into new alternating-speaker audio tracks. (c) Dataset Example: A sample of 3D reconstructions overlaid on videos from our curated dyadic dataset.
  • Figure 4: Distribution visualization of the conversational dataset.
  • ...and 1 more figures