Table of Contents
Fetching ...

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Boyang Wang, Xuweiyi Chen, Matheus Gadelha, Zezhou Cheng

TL;DR

Frame In-N-Out introduces an unbounded-canvas approach to image-to-video generation, enabling Frame In and Frame Out effects by conditioning on an expanded canvas $B_{canvas}$, motion trajectories $c_{trajs}$, and optional identity references $f$ alongside the first frame $I_0$ and text $oldsymbol{y}$. The authors present a semi-automatic dataset curation pipeline and a video Diffusion Transformer with a two-stage training scheme that unifies spatiotemporal motion, unaligned identity, and unbounded conditioning. Empirical results on a purpose-built evaluation suite show substantial gains over state-of-the-art baselines in both Frame Out and Frame In scenarios, including improved FID, FVD, LPIPS, trajectory alignment, and identity-consistency metrics. The work has practical implications for cinematic production and advertising, enabling more expressive and controllable video synthesis beyond traditional frame-bound constraints, while highlighting avenues for future improvements in 3D motion ambiguity and ID-size control.

Abstract

Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by a user-specified motion trajectory. To support this task, we introduce a new dataset that is curated semi-automatically, an efficient identity-preserving motion-controllable video Diffusion Transformer architecture, and a comprehensive evaluation protocol targeting this task. Our evaluation shows that our proposed approach significantly outperforms existing baselines.

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

TL;DR

Frame In-N-Out introduces an unbounded-canvas approach to image-to-video generation, enabling Frame In and Frame Out effects by conditioning on an expanded canvas , motion trajectories , and optional identity references alongside the first frame and text . The authors present a semi-automatic dataset curation pipeline and a video Diffusion Transformer with a two-stage training scheme that unifies spatiotemporal motion, unaligned identity, and unbounded conditioning. Empirical results on a purpose-built evaluation suite show substantial gains over state-of-the-art baselines in both Frame Out and Frame In scenarios, including improved FID, FVD, LPIPS, trajectory alignment, and identity-consistency metrics. The work has practical implications for cinematic production and advertising, enabling more expressive and controllable video synthesis beyond traditional frame-bound constraints, while highlighting avenues for future improvements in 3D motion ambiguity and ID-size control.

Abstract

Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by a user-specified motion trajectory. To support this task, we introduce a new dataset that is curated semi-automatically, an efficient identity-preserving motion-controllable video Diffusion Transformer architecture, and a comprehensive evaluation protocol targeting this task. Our evaluation shows that our proposed approach significantly outperforms existing baselines.

Paper Structure

This paper contains 26 sections, 9 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Frame In-N-Out presents a new task in the image-to-video generation that extends the first frame into an unbounded canvas region, where the model could be conditioned on identity reference with motion trajectory control to achieve Frame In and Frame Out cinematic technique.
  • Figure 2: Data Curation Pipeline. Our curation pipeline will provide high-quality filtered videos, text prompts, tracking trajectories with semantic labels, and bounding boxes that can be ideal partitions between the first frame and canvas region.
  • Figure 3: Main Architecture. Our video Diffusion Transformer embraces the first frame with canvas expansion, motion trajectories, identity reference, and text prompt as conditions for video generation.
  • Figure 4: Qualitative comparison on our benchmark dataset. In (a), we compare our model on Frame Out cases against DragAnything wu2024draganything and ToRA zhang2024tora. Both baselines fail to fully move the person outside the image boundaries, while our model successfully handles a complete exit. In (b), we evaluate Frame In scenarios against Phantom liu2025phantom and SkyReels-A2 fei2025skyreels. Only our model can reach Frame In effect with the designated identity.
  • Figure 5: Inference Pipeline.
  • ...and 6 more figures