Table of Contents
Fetching ...

TVG: A Training-free Transition Video Generation Method with Diffusion Models

Rui Zhang, Yaosen Chen, Yuegen Liu, Wei Wang, Xuming Wen, Hongxia Wang

TL;DR

This work introduces TVG, a training-free transition video generation method that leverages video-level diffusion with latent-space Gaussian Process Regression, interpolation-based conditional controls, and a Frequence-aware Bidirectional Fusion architecture to produce smooth, coherent transitions between frames without retraining. The approach refines conditional inputs via SLERP-based text prompts and CLIP-based image conditioning, enforces temporal consistency through GPR in the latent space, and fuses forward and reverse generation in the frequency domain to stabilize transitions. Empirical results on MorphBench and TC-Bench-I2V show competitive or superior performance across qualitative and quantitative metrics, with strong human preferences for smoother transitions. The work demonstrates practical potential for high-dynamic transition generation using pre-trained diffusion models, while acknowledging limitations in generating long sequences and suggesting future work to extend beyond 16-frame clips.

Abstract

Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives. Traditional methods like morphing often lack artistic appeal and require specialized skills, limiting their effectiveness. Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes. We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training. Our method leverages Gaussian Process Regression ($\mathcal{GPR}$) to model latent representations, ensuring smooth and dynamic transitions between frames. Additionally, we introduce interpolation-based conditional controls and a Frequency-aware Bidirectional Fusion (FBiF) architecture to enhance temporal control and transition reliability. Evaluations of benchmark datasets and custom image pairs demonstrate the effectiveness of our approach in generating high-quality smooth transition videos. The code are provided in https://sobeymil.github.io/tvg.com.

TVG: A Training-free Transition Video Generation Method with Diffusion Models

TL;DR

This work introduces TVG, a training-free transition video generation method that leverages video-level diffusion with latent-space Gaussian Process Regression, interpolation-based conditional controls, and a Frequence-aware Bidirectional Fusion architecture to produce smooth, coherent transitions between frames without retraining. The approach refines conditional inputs via SLERP-based text prompts and CLIP-based image conditioning, enforces temporal consistency through GPR in the latent space, and fuses forward and reverse generation in the frequency domain to stabilize transitions. Empirical results on MorphBench and TC-Bench-I2V show competitive or superior performance across qualitative and quantitative metrics, with strong human preferences for smoother transitions. The work demonstrates practical potential for high-dynamic transition generation using pre-trained diffusion models, while acknowledging limitations in generating long sequences and suggesting future work to extend beyond 16-frame clips.

Abstract

Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives. Traditional methods like morphing often lack artistic appeal and require specialized skills, limiting their effectiveness. Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes. We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training. Our method leverages Gaussian Process Regression () to model latent representations, ensuring smooth and dynamic transitions between frames. Additionally, we introduce interpolation-based conditional controls and a Frequency-aware Bidirectional Fusion (FBiF) architecture to enhance temporal control and transition reliability. Evaluations of benchmark datasets and custom image pairs demonstrate the effectiveness of our approach in generating high-quality smooth transition videos. The code are provided in https://sobeymil.github.io/tvg.com.
Paper Structure (20 sections, 11 equations, 22 figures, 4 tables)

This paper contains 20 sections, 11 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Failed results from some commercial products: (a) LUMA AI, (b) Jimeng AI, (c) Kling AI.
  • Figure 2: Illustration of Our Proposed Training-free TVG Method.
  • Figure 3: Samples of forward and reverse videos generated by DynamiCrafter, showing selected frames at 0, 5, 10, and 15. The top row shows the forward video, and the bottom row shows the reverse video.
  • Figure 4: Visualization of Our Generated Samples.
  • Figure 5: Comparison of Generated Sequences from TC-Bench Across Various Models.
  • ...and 17 more figures