SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads
Tan Yu, Qian Qiao, Le Shen, Ke Zhou, Jincheng Hu, Dian Sheng, Bo Hu, Haoming Qin, Jun Gao, Changhai Zhou, Shunshun Yin, Siyuan Liu
TL;DR
SoulX-FlashHead introduces a $1.3$B diffusion-transformer framework for real-time, infinite-length talking-head generation, addressing both audio feature instability during streaming and error accumulation in autoregressive long sequences. The method blends Streaming-Aware Spatiotemporal Pre-training with a Temporal Audio Context Cache and Oracle-Guided Bidirectional Distillation to achieve high fidelity and stable lip-sync, backed by the VividHead dataset and the TalkVivid-scale data pipeline. Empirical results on HDTF and VFHQ show state-of-the-art performance, with a real-time Lite variant reaching up to 96 FPS on consumer GPUs and a Pro variant delivering higher visual quality. This work enables fast, coherent interactive avatars at scale, while acknowledging ethical considerations and the need for safeguards against misuse.
Abstract
Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.
