Table of Contents
Fetching ...

MultiCOIN: Multi-Modal COntrollable Video INbetweening

Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao

TL;DR

MultiCOIN addresses the challenge of controllable video inbetweening between distant keyframes by unifying multiple edit signals—trajectory, depth, target regions, and text prompts—into a sparse point-based input for a Diffusion Transformer backbone. It introduces two dedicated control pathways (Sparse Motion/Depth Generators and Augmented Frame Generator) and a dual-branch encoder design, trained in stages to stabilize learning and improve alignment with user cues. The approach yields more accurate motion trajectories, richer content control, and robust long-video coherence, outperforming trajectory-only baselines in both qualitative and quantitative evaluations. This framework enables flexible, fine-grained video interpolation suitable for creative aims in editing and long-form synthesis.

Abstract

Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

MultiCOIN: Multi-Modal COntrollable Video INbetweening

TL;DR

MultiCOIN addresses the challenge of controllable video inbetweening between distant keyframes by unifying multiple edit signals—trajectory, depth, target regions, and text prompts—into a sparse point-based input for a Diffusion Transformer backbone. It introduces two dedicated control pathways (Sparse Motion/Depth Generators and Augmented Frame Generator) and a dual-branch encoder design, trained in stages to stabilize learning and improve alignment with user cues. The approach yields more accurate motion trajectories, richer content control, and robust long-video coherence, outperforming trajectory-only baselines in both qualitative and quantitative evaluations. This framework enables flexible, fine-grained video interpolation suitable for creative aims in editing and long-form synthesis.

Abstract

Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

Paper Structure

This paper contains 16 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Our model, MultiCOIN, takes a start and end image frame to generate an interpolative video inbetweening. It supports multi-modal controls, including depth change and layering, motion trajectories, text prompts, and target regions for movement localization, to generate smooth and plausible transitions. The control can be used individually (top four rows) to create diverse results even with the same input pair (e.g., different depth layering results in top two rows). The control can also be organized in a general complementary way to ease the user's interactions. For example, target regions may be used for content control, while trajectory provides motion information. Also, while specifying the general movement of the woman by text, the user can exert accurate spatial control for the bird with target region.
  • Figure 2: Overview of our MultiCOIN pipeline. Given a video $X$, we extract multi-modal motion controls through two generators: the Sparse Motion Generator via optical flow and the Sparse Depth Generator for depth maps, both producing sparse RGB points for trajectory and depth control. An Augmented Frame Generator computes target regions and masks to enable fine-grained content control. All control signals are encoded via a dual-branch embedder architecture that separately captures motion and content features. In addition, a text prompt condition is processed by a text encoder to provide semantic guidance over the generated content. At inference, the model flexibly integrates these multi-modal controls for interpolation.
  • Figure 3: Sparse Motion and Depth Generator. Given video $X$, dense optical flow and depth maps are computed. Trajectories are selected from high-motion regions along which flow/depth points are sampled and expanded with 2D filters to get sparse RGB inputs.
  • Figure 4: Example of a witch moving the Jack-o’-Lantern along the same path, with motion inward (top) or outward (bottom), depending on midpoint depth (blue vs. red dot).
  • Figure 5: Our results illustrate several ways multi-modal controls can be applied to frame interpolation. In the top section, we show trajectory control on its own, followed by two depth variations that place the cat either in front of or behind the pumpkin. Combining trajectory with depth produces richer motion: the balloon recedes along the z-axis while the weights with the cat are pushed outward. Prompts can also be paired with trajectories, where the trajectory sets the overall movement and the prompt refines details. In the bottom section, we highlight target region control. The temporal placement of target regions determines content editing at that point: in the first case, they are inserted in the middle with both first and last frames given, while in the second they appear at the end serving as a soft replacement for the last frame.
  • ...and 4 more figures