Table of Contents
Fetching ...

SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

Tan Yu, Qian Qiao, Le Shen, Ke Zhou, Jincheng Hu, Dian Sheng, Bo Hu, Haoming Qin, Jun Gao, Changhai Zhou, Shunshun Yin, Siyuan Liu

TL;DR

SoulX-FlashHead introduces a $1.3$B diffusion-transformer framework for real-time, infinite-length talking-head generation, addressing both audio feature instability during streaming and error accumulation in autoregressive long sequences. The method blends Streaming-Aware Spatiotemporal Pre-training with a Temporal Audio Context Cache and Oracle-Guided Bidirectional Distillation to achieve high fidelity and stable lip-sync, backed by the VividHead dataset and the TalkVivid-scale data pipeline. Empirical results on HDTF and VFHQ show state-of-the-art performance, with a real-time Lite variant reaching up to 96 FPS on consumer GPUs and a Pro variant delivering higher visual quality. This work enables fast, coherent interactive avatars at scale, while acknowledging ethical considerations and the need for safeguards against misuse.

Abstract

Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.

SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

TL;DR

SoulX-FlashHead introduces a B diffusion-transformer framework for real-time, infinite-length talking-head generation, addressing both audio feature instability during streaming and error accumulation in autoregressive long sequences. The method blends Streaming-Aware Spatiotemporal Pre-training with a Temporal Audio Context Cache and Oracle-Guided Bidirectional Distillation to achieve high fidelity and stable lip-sync, backed by the VividHead dataset and the TalkVivid-scale data pipeline. Empirical results on HDTF and VFHQ show state-of-the-art performance, with a real-time Lite variant reaching up to 96 FPS on consumer GPUs and a Pro variant delivering higher visual quality. This work enables fast, coherent interactive avatars at scale, while acknowledging ethical considerations and the need for safeguards against misuse.

Abstract

Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.
Paper Structure (13 sections, 5 equations, 3 figures, 4 tables)

This paper contains 13 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our comprehensive data filtering pipeline. We obtain 782 hours of high-quality audio–video data from 10k hours
  • Figure 2: Framework Overview of SoulX-FlashHead. (a) Stage 1: Streaming-Aware Spatiotemporal Pre-training. We employ a Temporal Audio Context Cache to stabilize feature extraction from short streaming audio and utilize channel-wise concatenation for robust reference image injection. (b) Stage 2: Oracle-Guided Bidirectional Distillation. To mitigate error accumulation, the Student generates autoregressively conditioned on its own historical predictions, while the Teacher utilizes Ground Truth motion frames as an "Oracle" guide. The model is optimized via a Stochastic Truncation Strategy using DMD and latent regression losses.
  • Figure 3: Qualitative comparison on 60-second video generation at 25 fps.Yellow dashed regions illustrate lip-synchronization mismatches in motion-based methods like Ditto and SadTalker while green indicators point to severe error accumulation and identity drift in Hallo3. Red boxes reveal holistic inconsistencies where elements like headgear separate from the subject due to the lack of unified pixel latent space modeling. In contrast, SoulX-FlashHead maintains robust lip synchronization, structural integrity, and holistic consistency throughout the sequence.