Table of Contents
Fetching ...

Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

Wooseok Jeon, Seunghyun Shin, Dongmin Shin, Hae-Gon Jeon

TL;DR

Motion Prior Distillation is proposed, a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path and yield more temporally coherent inbetweening results with the forward motion prior.

Abstract

Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.

Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

TL;DR

Motion Prior Distillation is proposed, a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path and yield more temporally coherent inbetweening results with the forward motion prior.

Abstract

Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.
Paper Structure (21 sections, 21 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 21 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of the proposed motion prior distillation. (a) Ideal case of generative inbetweening task. (b) Motion prior conflict in existing time reversal sampling method. (c) Our proposed motion prior distillation method. (d) A video generated by Stable Video Diffusion model conditioned on the start frame, and (e) conditioned on the end frame and temporally flipped. (f) A result from existing time reversal sampling method, showing ghosting artifact and reverse play due to motion prior conflict. (g) A result from our proposed method, showing temporally coherent motion.
  • Figure 2: Denoising process of the proposed motion prior distillation. Existing time reversal sampling methods simply connect the two temporal paths either by (a) linearly fusing them or (b) alternatively denoising each path. (c) Our MPD is employed on time reversal sampling framework to distill forward motion prior into the backward path, thereby achieving motion alignment.
  • Figure 3: Qualitative baseline comparisons. TRF and ViBiD suffer from back-and-forth motion and intermittent disappearance, while GI and FCVG exhibit noticeable artifacts and ghosting effects. Our method yields more temporally consistent motion than the comparison methods. Additional examples are provided in the project page.
  • Figure 4: Ablation study on the effect of distillation ratio $\gamma$. We vary the distillation step ratio $\gamma \in \{0.2, 0.4, 0.6, 0.8, 1.0\}$, where $\gamma = 0.2$ corresponds to the default setting and $\gamma = 1$ applies our method at every denoising step.
  • Figure I: Analysis on denoised estimates. At the midpoint of time reversal sampling, we take the forward and backward denoised estimate, align the latter to the temporal order, and inspect their difference.
  • ...and 5 more figures