Table of Contents
Fetching ...

Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach

Yaofang Liu, Yumeng Ren, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H. Chan, Jean-michel Morel

TL;DR

This work identifies a fundamental limitation in existing video diffusion models: a single scalar timestep constrains temporal dynamics across frames. It introduces Frame-Aware Video Diffusion Model (FVDM) which uses a vectorized timestep variable to allow per-frame noise schedules, enabling finer temporal modeling and broad zero-shot capabilities. Key innovations include per-frame forward diffusion with independent noise scales, a score-based reverse process, and a probabilistic timestep sampling strategy to manage computational cost. Empirical results on multiple datasets show state-of-the-art or competitive video quality, with strong performance in standard video generation and versatile zero-shot tasks such as image-to-video, interpolation, and long-video synthesis. The approach sets a new paradigm for temporally coherent video synthesis with potential for further extensions and applications in multimedia generation.

Abstract

Diffusion models have revolutionized image generation, and their extension to video generation has shown promise. However, current video diffusion models~(VDMs) rely on a scalar timestep variable applied at the clip level, which limits their ability to model complex temporal dependencies needed for various tasks like image-to-video generation. To address this limitation, we propose a frame-aware video diffusion model~(FVDM), which introduces a novel vectorized timestep variable~(VTV). Unlike conventional VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies. FVDM's flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot methods.Our empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks. By addressing fundamental shortcomings in existing VDMs, FVDM sets a new paradigm in video synthesis, offering a robust framework with significant implications for generative modeling and multimedia applications.

Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach

TL;DR

This work identifies a fundamental limitation in existing video diffusion models: a single scalar timestep constrains temporal dynamics across frames. It introduces Frame-Aware Video Diffusion Model (FVDM) which uses a vectorized timestep variable to allow per-frame noise schedules, enabling finer temporal modeling and broad zero-shot capabilities. Key innovations include per-frame forward diffusion with independent noise scales, a score-based reverse process, and a probabilistic timestep sampling strategy to manage computational cost. Empirical results on multiple datasets show state-of-the-art or competitive video quality, with strong performance in standard video generation and versatile zero-shot tasks such as image-to-video, interpolation, and long-video synthesis. The approach sets a new paradigm for temporally coherent video synthesis with potential for further extensions and applications in multimedia generation.

Abstract

Diffusion models have revolutionized image generation, and their extension to video generation has shown promise. However, current video diffusion models~(VDMs) rely on a scalar timestep variable applied at the clip level, which limits their ability to model complex temporal dependencies needed for various tasks like image-to-video generation. To address this limitation, we propose a frame-aware video diffusion model~(FVDM), which introduces a novel vectorized timestep variable~(VTV). Unlike conventional VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies. FVDM's flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot methods.Our empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks. By addressing fundamental shortcomings in existing VDMs, FVDM sets a new paradigm in video synthesis, offering a robust framework with significant implications for generative modeling and multimedia applications.
Paper Structure (16 sections, 12 equations, 7 figures, 1 table)

This paper contains 16 sections, 12 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Previous conventional video diffusion models (b) directly extend image diffusion models (a) utilizing a single scalar timestep on the whole video clip. This straightforward adaption restricts the flexibilities of VDM's in downstream tasks, e.g., image-to-video generation, longer video generation. In this paper, we propose Frame-Aware Video Diffusion Model (FVDM), which trains the denoiser via a vectorized timestep variable (c). Our method attains superior visual quality not only in standard video generation but also enables multiple downstream tasks in a zero-shot manner.
  • Figure 2: Diverse Applications of FVDM. (a) Standard Video Generation: Implements uniform timestep across frames, $[t, t, \ldots, t]$. (b) Image-to-Video Generation: Transforms a static image into a video using a customized vectorized timestep, $[\tau^1, t, \ldots, t]$, $\tau^{1} \equiv 0$. (c) Video Interpolation: Smoothly interpolates frames between start and end, using $[\tau^1, t, \ldots, t, \tau^N]$, $\tau^{1} = \tau^{N} \equiv 0$. (d) Long Video Generation: Extends sequences by conditioning on final frames, applying $[\tau^1, \ldots, \tau^M, t, \ldots, t]$, $\tau^{1} = ... = \tau^{M} \equiv 0$ (e) Many More Zero-Shot Applications: Highlights potential for tasks such as any frame conditioning, video transition, and next frame prediction.
  • Figure 3: Comprehensive ablation study on FaceForensics dataset rossler2018faceforensics for video generation using FVD metric (lower is better) with training iterations from $50k$ to $200k$. Top, bottom left, and bottom right figures indicate ablation studies for sampling probability ($p$), inference schedule, and model scale, respectively.
  • Figure 4: Qualitative comparison of real samples and generated video samples from FVDM/Ours and Latte ma2024latte on four datasets, i.e., FaceForensics rossler2018faceforensics, SkyTimelapse xiong2018learning, UCF101 soomro2012ucf101, and Taichi-HD Siarohin_2019_NeurIPS (from top to bottom). For a fair comparison, we select samples either of the same class w.r.t. UCF101 soomro2012ucf101 or with similar content w.r.t. other datasets. FVDM produces more coherent and realistic video sequences compared to the baseline.
  • Figure 5: Image-to-video generation
  • ...and 2 more figures