Table of Contents
Fetching ...

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng, Bo Zheng

TL;DR

UniLS tackles the challenge of end-to-end audio-driven avatars that can both speak and listen by addressing listening motion stiffness through a two-stage training paradigm. It first learns an internal motion prior with an audio-free autoregressive generator, then finetunes with dual-track audio to modulate this prior for speaking and listening. The approach achieves state-of-the-art speaking accuracy and substantial improvements in listening diversity and naturalness, including up to 44.1% gains in distributional listening metrics, while maintaining real-time performance. This work provides a practical, high-fidelity framework for interactive digital humans driven solely by audio.

Abstract

Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1\% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

TL;DR

UniLS tackles the challenge of end-to-end audio-driven avatars that can both speak and listen by addressing listening motion stiffness through a two-stage training paradigm. It first learns an internal motion prior with an audio-free autoregressive generator, then finetunes with dual-track audio to modulate this prior for speaking and listening. The approach achieves state-of-the-art speaking accuracy and substantial improvements in listening diversity and naturalness, including up to 44.1% gains in distributional listening metrics, while maintaining real-time performance. This work provides a practical, high-fidelity framework for interactive digital humans driven solely by audio.

Abstract

Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1\% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.

Paper Structure

This paper contains 22 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between previous methods and our proposed approach. Most previous studies remain one-way, i.e., speak-only or listen-only. The previous speak–listen method peng2025dualtalk requires generating speaker A’s facial sequence before producing speaker B’s motions. The speaker-A generation makes it non-end-to-end and blocks real-time. In contrast, our method provides an end-to-end framework for unified, real-time speak–listen motion generation.
  • Figure 2: Correlation between facial expression parameters FLAME2017 and corresponding audio features baevski2020wav2vec. For speaking, the audio is the speaker’s own speech. For listening, the audio comes from the other speaker’s speech.
  • Figure 3: Overview of our two-stage training strategy. Stage 1 trains an autoregressive free generator on unpaired multi-scenario video data without using audio. Given past motions and a style embedding, the model predicts future free motion chunks. Stage 2 finetunes the generator on paired conversational clips by conditioning on speaker-A and speaker-B’s audios through cross-attention, producing audio-driven speak–listen motions.
  • Figure 4: Qualitative comparison on listening motions. Red rectangles highlight motion stiffness over time. Additional qualitative evaluation results are available in the supplementary materials.
  • Figure 5: Qualitative comparison on speaking motions. Our method shows better alignment with the ground truth in expression style and lip synchronization. Additional qualitative evaluation results are available in the supplementary materials.
  • ...and 1 more figures