Table of Contents
Fetching ...

Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation

Dingcheng Zhen, Shunshun Yin, Shiyang Qin, Hou Yi, Ziwei Zhang, Siyuan Liu, Gan Qi, Ming Tao

TL;DR

Teller addresses the challenge of real-time, audio-driven talking head animation by introducing an autoregressive streaming framework that jointly models facial and body motions. It decomposes motion into Facial Motion Latent Generation and Efficient Temporal Module, using residual vector quantization to convert motion latents into discrete tokens and an AR transformer to map live audio to motion tokens, followed by temporal refinement with a 3D U‑Net based ETM. The approach achieves up to 25 FPS, significantly faster than diffusion models, and demonstrates superior fidelity and lip synchronization across HDTF and RAVDESS benchmarks, supported by extensive quantitative and human evaluations. This work highlights a practical, efficient path toward realistic, interactive talking heads and shows compatibility with existing multimodal language models.

Abstract

In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion. Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements. Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments demonstrate that our method outperforms recent audio-driven portrait animation models, especially in small movements, as validated by human evaluations with a significant margin in quality and realism.

Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation

TL;DR

Teller addresses the challenge of real-time, audio-driven talking head animation by introducing an autoregressive streaming framework that jointly models facial and body motions. It decomposes motion into Facial Motion Latent Generation and Efficient Temporal Module, using residual vector quantization to convert motion latents into discrete tokens and an AR transformer to map live audio to motion tokens, followed by temporal refinement with a 3D U‑Net based ETM. The approach achieves up to 25 FPS, significantly faster than diffusion models, and demonstrates superior fidelity and lip synchronization across HDTF and RAVDESS benchmarks, supported by extensive quantitative and human evaluations. This work highlights a practical, efficient path toward realistic, interactive talking heads and shows compatibility with existing multimodal language models.

Abstract

In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion. Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements. Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments demonstrate that our method outperforms recent audio-driven portrait animation models, especially in small movements, as validated by human evaluations with a significant margin in quality and realism.

Paper Structure

This paper contains 14 sections, 11 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Teller framework is the first autoregressive framework for real-time, audio-driven portrait animation, achieving up to 25 FPS while preserving realistic body part and accessory movements. Demo can be found at https://teller-avatar.github.io/.
  • Figure 2: Overall framework of our proposed Teller for real-time streaming audio-driven portrait animation.
  • Figure 3: In our Teller, we follow AR transformer architecture, but each input consists of a pair of tokens and model pred a pair of tokens for each output position.
  • Figure 4: Qualitative comparison with existing approaches on RAVDESS data-set of 'angry' and 'disgust' emotion cases. Videos are available in the supplementary materials.
  • Figure 5: Qualitative comparison with existing approaches on HDTF dataset. Videos are available in the supplement. mat.
  • ...and 6 more figures