Table of Contents
Fetching ...

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat

TL;DR

This paper proposes a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer, and closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency.

Abstract

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.

Mode Seeking meets Mean Seeking for Fast Long Video Generation

TL;DR

This paper proposes a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer, and closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency.

Abstract

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.
Paper Structure (21 sections, 12 equations, 4 figures, 2 tables)

This paper contains 21 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Video length is not analogous to image resolution.Top: For images, moving from low to high resolution is largely an interpolation of the same underlying local patch distribution. Bottom: For videos, moving from short clips to long sequences is temporal extrapolation. The model must introduce new events and causal structure beyond the short-clip horizon, which is fundamentally harder than multi-resolution image training.
  • Figure 2: Overview: mode seeking meets mean seeking. A shared long-context condition encoder $E_\phi$ maps a noisy long-video latent $x_t^{\text{long}}$ (with timestep $t$ and conditioning $c$) to a unified representation $h_t$. Two lightweight decoder heads read out velocities from $h_t$: the long context Flow Matching head $D^{\text{FM}}_\theta$ is trained with supervised flow matching on real long videos (mean-seeking), while the segment-wise Distribution Matching head $D^{\text{DM}}_\psi$ is trained via on-policy sliding-window reverse-KL alignment to an expert short-video teacher using DMD yin2024dmdyin2024dmd2/VSD wang2023vsd-style gradients (mode-seeking). Both objectives update the shared encoder, but each head receives only its corresponding signal.
  • Figure 3: Qualitative results. Our method generalizes well to various scenarios, producing long videos that maintain local fidelity and global coherence. All results are obtained using the Wan wang2025wan 1.3B model as both the student and the teacher, demonstrating how our decoupled training effectively extends short-video capabilities to long-horizon generation. We refer to our supplemental website for the videos and results from the Wan wang2025wan 14B model.
  • Figure 4: Qualitative comparison. "LongSFT" and "MixSFT" refers for long-context supervised finetuning (SFT) and mixed-length SFT, respectively. SFT-only methods (LongSFT, MixSFT) achieve decent long context and narrative, but appear to be blurry. Teacher-only methods (CausVid yin2025causvid, Self-Forcing huang2025selfforcing) suffer from quality degradation over long range. While InfinityRoPE yesiltepe2026infinityrope extends the temporal horizon, it does not have the ability to process long context and tends to generate still contents. Our method stands out as the best-performing model overall in terms of quality, motion, and long-horizon consistency. We refer to our supplementary website for video results.