DiffListener: Discrete Diffusion Model for Listener Generation

Siyeol Jung; Taehwan Kim

DiffListener: Discrete Diffusion Model for Listener Generation

Siyeol Jung, Taehwan Kim

TL;DR

DiffListener addresses the listener head generation problem by removing autoregressive dependence and employing a discrete diffusion process over a fixed codebook learned by a VQ-VAE. It conditions listener motion on rich speaker cues, including facial expressions, audio, and text, enhanced with a novel facial differential modality to capture temporal rhythms. The approach combines a two-stage VQ-VAE quantization of listener motions with a VQ-Diffusion framework and a fusion network that integrates multimodal speaker information, enabling longer, identity-specific listener sequences with maintained coherence. Empirical results on Trevor and Stephen demonstrate state-of-the-art L2, FD, and P-FD metrics and favorable human judgments, while ablations confirm the value of differential facial information and textual context for natural, context-aware reactions.

Abstract

The listener head generation (LHG) task aims to generate natural nonverbal listener responses based on the speaker's multimodal cues. While prior work either rely on limited modalities (e.g. audio and facial information) or employ autoregressive approaches which have limitations such as accumulating prediction errors. To address these limitations, we propose DiffListener, a discrete diffusion based approach for non-autoregressive listener head generation. Our model takes the speaker's facial information, audio, and text as inputs, additionally incorporating facial differential information to represent the temporal dynamics of expressions and movements. With this explicit modeling of facial dynamics, DiffListener can generate coherent reaction sequences in a non-autoregressive manner. Through comprehensive experiments, DiffListener demonstrates state-of-the-art performance in both quantitative and qualitative evaluations. The user study shows that DiffListener generates natural context-aware listener reactions that are well synchronized with the speaker. The code and demo videos are available in https://siyeoljung.github.io/DiffListener

DiffListener: Discrete Diffusion Model for Listener Generation

TL;DR

Abstract

DiffListener: Discrete Diffusion Model for Listener Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)