Table of Contents
Fetching ...

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim

TL;DR

Deep Forcing introduces a training-free solution to long-horizon autoregressive video generation by combining Deep Sink, which expands and temporally alignes attention sinks, with Participative Compression, which selectively preserves informative KV-cache tokens. By leveraging the pre-trained Self Forcing model’s inherent attention-sink behavior and applying RoPE-based temporal alignment, the approach stabilizes long-rollouts and minimizes error accumulation without fine-tuning. Empirical results demonstrate state-of-the-art or competitive performance on long-video benchmarks, user studies, and VLM-based evaluations, with minute-long generation and strong dynamic quality. This work shows that training-free KV-cache management can rival or exceed training-based methods for streaming long-video synthesis and offers practical implications for real-time visual generation systems.

Abstract

Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

TL;DR

Deep Forcing introduces a training-free solution to long-horizon autoregressive video generation by combining Deep Sink, which expands and temporally alignes attention sinks, with Participative Compression, which selectively preserves informative KV-cache tokens. By leveraging the pre-trained Self Forcing model’s inherent attention-sink behavior and applying RoPE-based temporal alignment, the approach stabilizes long-rollouts and minimizes error accumulation without fine-tuning. Empirical results demonstrate state-of-the-art or competitive performance on long-video benchmarks, user studies, and VLM-based evaluations, with minute-long generation and strong dynamic quality. This work shows that training-free KV-cache management can rival or exceed training-based methods for streaming long-video synthesis and offers practical implications for real-time visual generation systems.

Abstract

Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

Paper Structure

This paper contains 44 sections, 12 equations, 17 figures, 6 tables, 1 algorithm.

Figures (17)

  • Figure 1: Our training-free approach, Deep Forcing, achieves comparable visual quality to training-based baselines, such as Rolling Forcing liu2025rolling and LongLive yang2025longlive. Notably, Deep Forcing enables minute-long video generation while maintaining visual quality and dynamics without requiring any additional training.
  • Figure 2: Comparison of KV Cache Management. (a) Self Forcinghuang2025self adopts a FIFO policy that discards the earliest tokens regardless of their importance, often losing critical context and degrading generation quality. In contrast, our (b) Deep Forcing performs selective eviction by preserving Deep Sink tokens and applying KV-cache compression, effectively mitigating visual degradation during long-horizon generation.
  • Figure 3: Overview of Deep Forcing.(a) Deep Forcing maintains a substantially enlarged attention sink (Deep Sink) covering approximately half the context window, combined with Participative Compression for the remaining rolling portion. Temporal RoPE adjustment aligns the sink tokens' temporal indices with current frames to maintain temporal coherence. (b) Participative Compression computes query-averaged attention scores between recent tokens and candidate tokens, selecting the top-C most important tokens to retain in the compressed cache while evicting redundant tokens.
  • Figure 4: Attention weight distribution across earlier frames. Query-averaged attention showing how the last chunk (frames 19-21) attends to earlier KV cache entries (frames 0-18). We visualize two representative attention heads from different layers—L1H1 (layer 1, head 1) and L5H10 (layer 5, head 10)—demonstrating that substantial attention is maintained across the entire context window, not just initial frames. See Appendix \ref{['sec:additional_attn_vis']} for additional heads analysis.
  • Figure 5: Ablation study on Deep Sink depth. We evaluate the effect of sink depth on video quality using Aesthetic Drift ($\downarrow$) and Overall Consistency ($\uparrow$) metrics on 50-second videos from the first 21 prompts in MovieGen polyak2024movie.
  • ...and 12 more figures