Table of Contents
Fetching ...

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen

TL;DR

The paper tackles the challenge of maintaining long-term consistency in autoregressive video generation by addressing a fundamental student-teacher mismatch: prior methods train long-horizon students with short-context teachers, incurring forgetting and drift. It introduces Context Forcing, which employs a long-context teacher and a two-stage Distillation via Contextual Distribution Matching (CDMD), coupled with a Slow-Fast Memory KV cache that compresses history while preserving salient dynamics. A robust Context Teacher is trained with Error-Recycling Fine-Tuning to remain reliable when student histories drift, enabling effective supervision for very long sequences (exceeding $20$ seconds and up to minutes). Across video continuation, text-to-video, and long-video generation tasks, the approach demonstrates improved long-range coherence and reduced drift compared with state-of-the-art baselines, highlighting practical improvements for open-ended video synthesis and potential downstream applications, while acknowledging the need for careful ethical considerations and memory optimization opportunities.

Abstract

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.

Context Forcing: Consistent Autoregressive Video Generation with Long Context

TL;DR

The paper tackles the challenge of maintaining long-term consistency in autoregressive video generation by addressing a fundamental student-teacher mismatch: prior methods train long-horizon students with short-context teachers, incurring forgetting and drift. It introduces Context Forcing, which employs a long-context teacher and a two-stage Distillation via Contextual Distribution Matching (CDMD), coupled with a Slow-Fast Memory KV cache that compresses history while preserving salient dynamics. A robust Context Teacher is trained with Error-Recycling Fine-Tuning to remain reliable when student histories drift, enabling effective supervision for very long sequences (exceeding seconds and up to minutes). Across video continuation, text-to-video, and long-video generation tasks, the approach demonstrates improved long-range coherence and reduced drift compared with state-of-the-art baselines, highlighting practical improvements for open-ended video synthesis and potential downstream applications, while acknowledging the need for careful ethical considerations and memory optimization opportunities.

Abstract

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
Paper Structure (18 sections, 9 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Context Forcing mitigates the forgetting--drifting dilemma. (1) State-of-the-art models are limited by short context windows (3.0--9.2 s), which leads to poor long-term consistency (Forgetting). (2) For streaming long-context tuning baselines (e.g., LongLive), enlarging the context window during inference (3.0 $\rightarrow$ 5.25 s) causes error accumulation and distribution shift (Drifting). In contrast, Context Forcing supports 20s+ context while maintaining strong long-term consistency.
  • Figure 2: Training paradigms for AR video diffusion models. (a) Self-forcing: A student matches a teacher capable of generating only 5s video using a 5s self-rollout. (b) Longlive yang2025longlive: The student performs long rollouts supervised by a memoryless 5s teacher on random chunks. The teacher's inability to see beyond its 5s window creates a student-teacher mismatch. (c) Context Forcing (Ours): The student is supervised by a long-context teacher aware of the full generation history, resolving the mismatch in (b).
  • Figure 3: Context Forcing and Context Management System. We use KV Cache as the context memory, and we organize it into three parts: sink, slow memory and fast memory. During contextual DMD training, the long teacher provides supervision to the long student by utilizing the same context memory mechanism.
  • Figure 4: Comparison on 1-min Video Generation. Our method keeps both the background and subject consistent across 1-min video, while other baselines have different levels drifting or identity shift.
  • Figure 5: Qualitative Results of Context Forcing. Our method enables minute-level video generation with minimal drifting and high consistency across diverse scenarios.
  • ...and 3 more figures