Frame In-N-Out: Unbounded Controllable Image-to-Video Generation
Boyang Wang, Xuweiyi Chen, Matheus Gadelha, Zezhou Cheng
TL;DR
Frame In-N-Out introduces an unbounded-canvas approach to image-to-video generation, enabling Frame In and Frame Out effects by conditioning on an expanded canvas $B_{canvas}$, motion trajectories $c_{trajs}$, and optional identity references $f$ alongside the first frame $I_0$ and text $oldsymbol{y}$. The authors present a semi-automatic dataset curation pipeline and a video Diffusion Transformer with a two-stage training scheme that unifies spatiotemporal motion, unaligned identity, and unbounded conditioning. Empirical results on a purpose-built evaluation suite show substantial gains over state-of-the-art baselines in both Frame Out and Frame In scenarios, including improved FID, FVD, LPIPS, trajectory alignment, and identity-consistency metrics. The work has practical implications for cinematic production and advertising, enabling more expressive and controllable video synthesis beyond traditional frame-bound constraints, while highlighting avenues for future improvements in 3D motion ambiguity and ID-size control.
Abstract
Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by a user-specified motion trajectory. To support this task, we introduce a new dataset that is curated semi-automatically, an efficient identity-preserving motion-controllable video Diffusion Transformer architecture, and a comprehensive evaluation protocol targeting this task. Our evaluation shows that our proposed approach significantly outperforms existing baselines.
