Table of Contents
Fetching ...

CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation

Xi Liu, Ying Guo, Cheng Zhen, Tong Li, Yingying Ao, Pengfei Yan

TL;DR

CustomListener addresses the limitation of fixed-emotion-driven listener models by enabling text-guided, user-customizable listener generation. The framework combines a Static to Dynamic Portrait (SDP) module for speaker-listener coordination with a Past Guided Generation (PGG) module to maintain long-term coherence, implemented atop a diffusion-based generator conditioned on time-varying portrait tokens and motion priors. Extensive experiments on ViCo and RealTalk show state-of-the-art performance in motion realism, synchronization with the speaker, and cross-segment consistency, validated by both quantitative metrics and user studies. The approach leverages GPT-driven text priors, RoBERTa encodings, 3DMM-based representations, and a PIRenderer-based rendering pipeline to deliver controllable, interactive listening heads with practical applicability in virtual interactions and HCI settings.

Abstract

Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.The applications of listener agent generation in virtual interaction have promoted many works achieving the diverse and fine-grained motion generation. However, they can only manipulate motions through simple emotional labels, but cannot freely control the listener's motions. Since listener agents should have human-like attributes (e.g. identity, personality) which can be freely customized by users, this limits their realism. In this paper, we propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation. To achieve speaker-listener coordination, we design a Static to Dynamic Portrait module (SDP), which interacts with speaker information to transform static text into dynamic portrait token with completion rhythm and amplitude information. To achieve coherence between segments, we design a Past Guided Generation Module (PGG) to maintain the consistency of customized listener attributes through the motion prior, and utilize a diffusion-based structure conditioned on the portrait token and the motion prior to realize the controllable generation. To train and evaluate our model, we have constructed two text-annotated listening head datasets based on ViCo and RealTalk, which provide text-video paired labels. Extensive experiments have verified the effectiveness of our model.

CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation

TL;DR

CustomListener addresses the limitation of fixed-emotion-driven listener models by enabling text-guided, user-customizable listener generation. The framework combines a Static to Dynamic Portrait (SDP) module for speaker-listener coordination with a Past Guided Generation (PGG) module to maintain long-term coherence, implemented atop a diffusion-based generator conditioned on time-varying portrait tokens and motion priors. Extensive experiments on ViCo and RealTalk show state-of-the-art performance in motion realism, synchronization with the speaker, and cross-segment consistency, validated by both quantitative metrics and user studies. The approach leverages GPT-driven text priors, RoBERTa encodings, 3DMM-based representations, and a PIRenderer-based rendering pipeline to deliver controllable, interactive listening heads with practical applicability in virtual interactions and HCI settings.

Abstract

Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.The applications of listener agent generation in virtual interaction have promoted many works achieving the diverse and fine-grained motion generation. However, they can only manipulate motions through simple emotional labels, but cannot freely control the listener's motions. Since listener agents should have human-like attributes (e.g. identity, personality) which can be freely customized by users, this limits their realism. In this paper, we propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation. To achieve speaker-listener coordination, we design a Static to Dynamic Portrait module (SDP), which interacts with speaker information to transform static text into dynamic portrait token with completion rhythm and amplitude information. To achieve coherence between segments, we design a Past Guided Generation Module (PGG) to maintain the consistency of customized listener attributes through the motion prior, and utilize a diffusion-based structure conditioned on the portrait token and the motion prior to realize the controllable generation. To train and evaluate our model, we have constructed two text-annotated listening head datasets based on ViCo and RealTalk, which provide text-video paired labels. Extensive experiments have verified the effectiveness of our model.
Paper Structure (46 sections, 11 equations, 14 figures, 6 tables)

This paper contains 46 sections, 11 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: The process of text-guided listener generation in our CustomListener. The text prior provides the basic portrait style of the listener, which is input into CustomListener and combined with the speaker's information to obtain the listener's motions.
  • Figure 2: Overall framework of CustomListener. Given the text prior providing the listener's static portrait style, SDP-Module transforms the static portrait into a dynamic one. Then in PGG-Module, the dynamic portrait token are combined with motion prior generated from Past-guided Module and are utilized as conditions of the diffusion-based structure to realize the controllable generation. The 'C' in the figure denotes concatenation and the small pink squares denotes diffusion time-step token.
  • Figure 3: Illustration of Audio-text Responsive Interaction. We first generate a weight matrix by responsive interaction between audio features and static portrait-token (SP Token), and then generate time-dependent portrait token (TP Token) guided by weight matrix.
  • Figure 4: Visual results produced by CustomListener. All listener videos are generated conditioned on different pre-set text priors, the same speaker video (the 1st row) and the same reference listener image.
  • Figure 5: Qualitative comparisons with PCH pchg, RLHG vico conditioned on the same speaker and the same listener reference image.
  • ...and 9 more figures