Table of Contents
Fetching ...

SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, Shunshun Yin

Abstract

Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.

SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

Abstract

Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.
Paper Structure (30 sections, 20 equations, 7 figures, 5 tables)

This paper contains 30 sections, 20 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: This phenomenon is observed when a causal attention mask is directly applied to a pretrained non-AR diffusion model, indicating that Neighbor Forcing effectively mitigates the mismatch between AR and non-AR diffusion.
  • Figure 2: Overall training pipeline of SoulX-LiveAct.Left: The architecture of DiT block. Right: The training pipeline consists of two stages: (i) training with step-aligned noisy references and a diffusion loss computed at the same step, and (ii) joint training of ConvKV memory and step distill.
  • Figure 3: Memory mechanism.Left: The KV states corresponding to the long-term memory are compressed via a lightweight 1D convolution operator; Right: The final inference pipeline unifies the Neighbor Forcing formulation and the ConvKV Memory mechanism.
  • Figure 4: Qualitative comparison of lip-motion accuracy and emotion–action coordination. SoulX-LiveAct achieves more precise lip–phoneme alignment and maintains coherent facial expressions and body movements under emotion–action interactions, while baseline methods show misalignment or temporal jitter.
  • Figure 5: Long-video consistency comparison. Red boxes indicate identity drift (notably severe in OmniAvatar), while yellow boxes highlight fine-grained detail inconsistencies (e.g., missing accessories in InfiniteTalk and Live-Avatar). SoulX-LiveAct maintains stable identity and details throughout the sequence.
  • ...and 2 more figures