ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Shaoshu Yang; Yong Zhang; Xiaodong Cun; Ying Shan; Ran He

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Shaoshu Yang, Yong Zhang, Xiaodong Cun, Ying Shan, Ran He

TL;DR

ZeroSmooth introduces a training-free approach to upscale the frame rate of pretrained video diffusion models by formulating a self-cascaded diffusion framework with transformer hidden-state corrections. The method preserves temporal consistency through back-projection-inspired corrections applied to temporal and spatial transformer states, and it controls correction strength to mitigate distribution mismatch. Empirical results across multiple base models and datasets show strong performance compared with tuning-free baselines and competitive results with training-based interpolators. The work broadens practical deployment of diffusion-based video generation by enabling plug-and-play, high-frame-rate synthesis without additional data or training.

Abstract

Video generation has made remarkable progress in recent years, especially since the advent of the video diffusion models. Many video generation models can produce plausible synthetic videos, e.g., Stable Video Diffusion (SVD). However, most video models can only generate low frame rate videos due to the limited GPU memory as well as the difficulty of modeling a large set of frames. The training videos are always uniformly sampled at a specified interval for temporal compression. Previous methods promote the frame rate by either training a video interpolation model in pixel space as a postprocessing stage or training an interpolation model in latent space for a specific base video model. In this paper, we propose a training-free video interpolation method for generative video diffusion models, which is generalizable to different models in a plug-and-play manner. We investigate the non-linearity in the feature space of video diffusion models and transform a video model into a self-cascaded video diffusion model with incorporating the designed hidden state correction modules. The self-cascaded architecture and the correction module are proposed to retain the temporal consistency between key frames and the interpolated frames. Extensive evaluations are preformed on multiple popular video models to demonstrate the effectiveness of the propose method, especially that our training-free method is even comparable to trained interpolation models supported by huge compute resources and large-scale datasets.

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

TL;DR

Abstract

Paper Structure (35 sections, 14 equations, 10 figures, 4 tables)

This paper contains 35 sections, 14 equations, 10 figures, 4 tables.

Introduction
Related Works
Video Diffusion Models
Zero-shot Visual Restoration and Video Interpolation
Preliminary
Video Diffusion Models
Visual Restoration with Diffusion Models
Method
Self-cascaded Video Diffusion Model
Temporal Attention for High Frame Rate Generation
Hidden State Correction for Transformers
Controlling Correction Strength
Experiments
Experiment Settings
Testing Datasets and Metrics
...and 20 more sections

Figures (10)

Figure 1: Our method enables pretrained video diffusion models for high frame rate (4$\times$ more than during training) generation without extra training data and parameter updates.
Figure 2: An overview of our method. (a) We build cascaded video diffusion model by adapting the base generator to generate at a higher frame rate. (b) A sketch for hidden states correction in transformers in ZeroSmooth.
Figure 3: (a) Examples for hidden states correction in 2$\times$ higher frame rate generation case, showcasing the queries, keys and values are calibration in temporal transformer (Temporal), and in spatial transformers using different interpolation operators (Spatial A1, Spatial A2). (b) We adapt temporal transformers to generate longer sequences in different ways. For the temporal module with relative positional embedding (RPE), we use windowed attention and apply RPE within each window. For absolute positional embedding (APE) modules, we interpolate the position index to get APE before applying attention operation.
Figure 4: Visual comparison between our method and other tuning-free baselines. The green rectangles capture the abrupt changes between adjacent frames.
Figure 5: An illustration of three stages self-cascaded model.
...and 5 more figures

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

TL;DR

Abstract

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)