Table of Contents
Fetching ...

AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, Xiaoqiang Liu

Abstract

Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: https://cuiliyuan121.github.io/AvatarForcing/

AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

Abstract

Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: https://cuiliyuan121.github.io/AvatarForcing/
Paper Structure (28 sections, 7 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: AR forcing vs. full-sequence DiT vs. AvatarForcing. AvatarForcing enables real-time, long-form talking-avatar generation from a reference image and streaming audio. It performs one-step joint denoising in a fixed sliding window to introduce bounded local-future context at constant latency, reducing autoregressive drift without full-sequence diffusion (34 ms/frame).
  • Figure 2: Windowed denoising with dual anchors and two-stage distillation. AvatarForcing performs one-step denoising over a fixed local-future window with heterogeneous timesteps (cleaner on the left and noisier on the right). At each step, the window is jointly updated under bidirectional attention to emit the leftmost clean block, slide the window, and append fresh noise. Long-horizon stability is supported by a dual-anchor KV cache, including a RoPE re-indexed style anchor (with anchor-audio zero-padding) and a temporal anchor constructed from recent clean blocks. Real-time one-step inference is achieved by distilling a global bidirectional teacher into a streaming student via two-stage training with offline ODE backfill and DMD post-training on student rollouts.
  • Figure 3: Long-form qualitative comparison. We compare AvatarForcing with representative autoregressive and diffusion-based talking-avatar models on long-form generation, with an emphasis on temporal stability, identity preservation, and audio--visual synchronization.
  • Figure 4: Latency vs. window length. Inference latency increases with both window length and the number of denoising steps. Window length is shown in frames; with block size $B=4$, a window of $L$ blocks spans $BL$ frames.
  • Figure 5: Window length $L$ vs. denoising steps $N$. Ablation on window lengths $L$ and denoising steps $N$ for $\mathcal{B}_{L,N}$.
  • ...and 3 more figures