DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion
Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, Mohammad Soleymani
TL;DR
DiTaiListener addresses the challenge of generating high-fidelity, controllable listener video in dyadic conversations by modeling a diffusion-transformer pipeline that directly synthesizes pixel-space listener portraits. It introduces a Causal Temporal Multimodal Adapter (CTM-Adapter) to fuse speaker audio $X_s$, speaker motion $X_m$, listener identity $X_i$, and optional text $X_t$ under temporal causality, and develops a long-form pipeline with DiTaiListener-Gen and DiTaiListener-Edit to produce seamless, extended videos. The approach achieves state-of-the-art results on RealTalk and ViCo in photorealism and motion representations, as validated by quantitative metrics and user studies, and supports text-guided customization of listener behavior. By enabling end-to-end, high-fidelity, and controllable listener synthesis, this work advances realistic virtual avatars for interactive systems, social robotics, and HCI, while noting areas for improvement in inference efficiency and ethics safeguards. Future work includes expanding behavioral diversity, speeding sampling, and enriching contextual cues for more responsive interactions.
Abstract
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.
