Table of Contents
Fetching ...

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, Mohammad Soleymani

TL;DR

DiTaiListener addresses the challenge of generating high-fidelity, controllable listener video in dyadic conversations by modeling a diffusion-transformer pipeline that directly synthesizes pixel-space listener portraits. It introduces a Causal Temporal Multimodal Adapter (CTM-Adapter) to fuse speaker audio $X_s$, speaker motion $X_m$, listener identity $X_i$, and optional text $X_t$ under temporal causality, and develops a long-form pipeline with DiTaiListener-Gen and DiTaiListener-Edit to produce seamless, extended videos. The approach achieves state-of-the-art results on RealTalk and ViCo in photorealism and motion representations, as validated by quantitative metrics and user studies, and supports text-guided customization of listener behavior. By enabling end-to-end, high-fidelity, and controllable listener synthesis, this work advances realistic virtual avatars for interactive systems, social robotics, and HCI, while noting areas for improvement in inference efficiency and ethics safeguards. Future work includes expanding behavioral diversity, speeding sampling, and enriching contextual cues for more responsive interactions.

Abstract

Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

TL;DR

DiTaiListener addresses the challenge of generating high-fidelity, controllable listener video in dyadic conversations by modeling a diffusion-transformer pipeline that directly synthesizes pixel-space listener portraits. It introduces a Causal Temporal Multimodal Adapter (CTM-Adapter) to fuse speaker audio , speaker motion , listener identity , and optional text under temporal causality, and develops a long-form pipeline with DiTaiListener-Gen and DiTaiListener-Edit to produce seamless, extended videos. The approach achieves state-of-the-art results on RealTalk and ViCo in photorealism and motion representations, as validated by quantitative metrics and user studies, and supports text-guided customization of listener behavior. By enabling end-to-end, high-fidelity, and controllable listener synthesis, this work advances realistic virtual avatars for interactive systems, social robotics, and HCI, while noting areas for improvement in inference efficiency and ethics safeguards. Future work includes expanding behavioral diversity, speeding sampling, and enriching contextual cues for more responsive interactions.

Abstract

Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

Paper Structure

This paper contains 15 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of DiTaiListener. a) Given the listener's appearance (reference frame), speaker's motion, encoded via EMOCA 3DMM coefficients, speech (Wav2Vec2) and an input text control, DiTaiListener learns to generate listener face and head motions in pixel space through a video diffusion model powered by a modified DiT. b) We introduce a Causal Temporal Multimodal Adapter for seamless integration of multimodal speaker input in a temporally causal manner. c) Our long video generation pipeline consists of two video generation models. DiTaiListener-Gen generates video blocks that are fused by the DiTaiListener-Edit model that facilitates the smooth transition between two blocks, improving smoothness and reducing computational cost compared to existing long video generation strategies, e.g., prompt traveling and teacher forcing.
  • Figure 2: Qualitative Comparison on ViCo test set. Our method generates high-quality, photorealistic facial images with diverse and natural social behaviors, including head movements and blinks, whereas baseline methods often produce less varied and expressive responses.
  • Figure 3: Listener generation from DiTaiListener on out-of-domain identities. Our method can integrate expressions from text conditions and synthesize diverse responses to the speakers.
  • Figure 4: Qualitative comparison of long video generation. Our method generates smoother videos with fewer transition artifacts compared to prompt traveling and teacher forcing methods.
  • Figure 5: A screenshot of user study survey example. The methods are anonymized as A, B, C, D, E, and the order is randomized.