Explorative Inbetweening of Time and Space

Haiwen Feng; Zheng Ding; Zhihao Xia; Simon Niklaus; Victoria Abrevaya; Michael J. Black; Xuaner Zhang

Explorative Inbetweening of Time and Space

Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Victoria Abrevaya, Michael J. Black, Xuaner Zhang

TL;DR

The paper introduces bounded generation as a general task for image-to-video models, enabling the synthesis of intermediate frames between arbitrary start and end frames without retraining. It presents Time Reversal Fusion, a training-free sampling strategy that jointly denoises forward from the start frame and backward from the end frame, then fuses the two trajectories to produce end-constrained videos, with an optional noise-reinjection step to preserve smooth transitions. Evaluations across dynamic bounds, view bounds, and identical bounds demonstrate substantial improvements over specialized baselines and are supported by perceptual studies, using a dedicated 395-image-pair dataset. The work highlights how bounded generation can reveal and leverage the latent dynamics learned by I2V models, offering a practical approach to controlled video generation and a lens for probing model understanding of motion and 3D structure.

Abstract

We introduce bounded generation as a generalized task to control video generation to synthesize arbitrary camera and subject motion based only on a given start and end frame. Our objective is to fully leverage the inherent generalization capability of an image-to-video model without additional training or fine-tuning of the original model. This is achieved through the proposed new sampling strategy, which we call Time Reversal Fusion, that fuses the temporally forward and backward denoising paths conditioned on the start and end frame, respectively. The fused path results in a video that smoothly connects the two frames, generating inbetweening of faithful subject motion, novel views of static scenes, and seamless video looping when the two bounding frames are identical. We curate a diverse evaluation dataset of image pairs and compare against the closest existing methods. We find that Time Reversal Fusion outperforms related work on all subtasks, exhibiting the ability to generate complex motions and 3D-consistent views guided by bounded frames. See project page at https://time-reversal.github.io.

Explorative Inbetweening of Time and Space

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Related Works
Control-based Video Generation
Bounded Frame Generation
Frame Interpolation.
Sparse Novel View Synthesis.
Sampling-based Guided Image Generation
Method
Preliminaries
Stable Video Diffusion (SVD)
Condition manipulation.
Temporal Inpainting.
End-Frame Guidance using Time Reversal Fusion
Enhancing Fusion with Noise Re-Injection
Experiments
...and 9 more sections

Figures (7)

Figure 1: Bounded generation in three scenarios: 1) Generating subject motion with the two bound images capturing a moving subject. 2) Synthesizing camera motion using two images captured from different viewpoints of a static scene. 3) Achieving video looping by using the same image for both bounds. We propose a new sampling strategy, called Time Reversal Fusion, to preserve the inherent generalization of an image-to-video model while steering the video generation towards an exact ending frame.
Figure 2: The impact of conditioning on video generation. We experiment with different conditioning strategy and show their effects on the generated video. (Row 1) Using a linear interpolation of A and B as the image condition, the generated video does not end at B. (Row 2) Swapping B with random noise yields similar results, indicating B imposes minimal influence on the generated contents. (Row 3) With the proposed time reversal fusion, our generated video starts with A and ends at B.
Figure 3: Image inpainting strategies do not apply to videos. We follow the standard diffusion inpainting method by replacing the last frame with the target frame at each denoising step. However, this results in a video that satisfies the end frame condition but with abrupt content changes, as indicated in the last frames in Row 1. Our method, on the other hand, generates a smooth video (Row 2) that ends at the given condition.
Figure 4: Pseudo code and illustration of Time Reversal Fusion. Initiated with identical noise and conditioned on the start and end frame, the two paths undergo the SVD (frozen) denoiser. The forward path is fused with a time reversed backward path to produce the output for the subsequent step. Noise is re-injected to the fused output to add stochasticity in the sampling process.
Figure 5: The impact of noise re-injection on fusion. (Row 1) Without any stochasticity, the video suffers from random dynamics and unsmooth transitions. (Row 2) Tuning the churn term in SVD leads to blurry and low-quality frames. (Row 3) Using noise re-injection leads to smooth and natural frame transitions.
...and 2 more figures

Explorative Inbetweening of Time and Space

TL;DR

Abstract

Explorative Inbetweening of Time and Space

Authors

TL;DR

Abstract

Table of Contents

Figures (7)