Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach
Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H. Chan, Jean-michel Morel
TL;DR
This work identifies a fundamental limitation in existing video diffusion models: a single scalar timestep constrains temporal dynamics across frames. It introduces Frame-Aware Video Diffusion Model (FVDM) which uses a vectorized timestep variable to allow per-frame noise schedules, enabling finer temporal modeling and broad zero-shot capabilities. Key innovations include per-frame forward diffusion with independent noise scales, a score-based reverse process, and a probabilistic timestep sampling strategy to manage computational cost. Empirical results on multiple datasets show state-of-the-art or competitive video quality, with strong performance in standard video generation and versatile zero-shot tasks such as image-to-video, interpolation, and long-video synthesis. The approach sets a new paradigm for temporally coherent video synthesis with potential for further extensions and applications in multimedia generation.
Abstract
Diffusion models have revolutionized image generation, and their extension to video generation has shown promise. However, current video diffusion models~(VDMs) rely on a scalar timestep variable applied at the clip level, which limits their ability to model complex temporal dependencies needed for various tasks like image-to-video generation. To address this limitation, we propose a frame-aware video diffusion model~(FVDM), which introduces a novel vectorized timestep variable~(VTV). Unlike conventional VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies. FVDM's flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot methods.Our empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks. By addressing fundamental shortcomings in existing VDMs, FVDM sets a new paradigm in video synthesis, offering a robust framework with significant implications for generative modeling and multimedia applications.
