Table of Contents
Fetching ...

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak

TL;DR

Through-The-Mask introduces a two-stage image-to-video framework that uses mask-based motion trajectories as an intermediate, object-centric representation to capture both motion and semantics. The Motion-to-Video stage employs masked cross-attention and masked self-attention to softly enforce object-specific prompts and per-object temporal consistency, enabling robust multi-object motion and improved text faithfulness. Empirical results on SA-V-128 and Image-Animation-Bench show state-of-the-art temporal coherence, motion realism, and image fidelity, with comprehensive ablations validating the contribution of the motion representation and masking strategy. The work also provides a new SA-V-128 benchmark to facilitate robust evaluation of I2V systems in single-object and multi-object scenarios, and demonstrates compatibility with both U-Net and DiT diffusion backbones.

Abstract

We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

TL;DR

Through-The-Mask introduces a two-stage image-to-video framework that uses mask-based motion trajectories as an intermediate, object-centric representation to capture both motion and semantics. The Motion-to-Video stage employs masked cross-attention and masked self-attention to softly enforce object-specific prompts and per-object temporal consistency, enabling robust multi-object motion and improved text faithfulness. Empirical results on SA-V-128 and Image-Animation-Bench show state-of-the-art temporal coherence, motion realism, and image fidelity, with comprehensive ablations validating the contribution of the motion representation and masking strategy. The work also provides a new SA-V-128 benchmark to facilitate robust evaluation of I2V systems in single-object and multi-object scenarios, and demonstrates compatibility with both U-Net and DiT diffusion backbones.

Abstract

We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.
Paper Structure (23 sections, 2 equations, 11 figures, 4 tables)

This paper contains 23 sections, 2 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Through-The-Mask is an Image-to-Video method that animates an input image based on a provided text caption. The generated video (rows 2 and 4) leverages mask-based motion trajectories (rows 1 and 3), enabling accurate animation of multiple objects.
  • Figure 2: Overview of our I2V framework, transforming a reference image $x^{(0)}$ and text prompt $c$ into a coherent video sequence $\hat{x}$. A pre-trained LLM is used to derive the motion-specific prompt $c_{motion}$ and object-specific prompts $c_{local} = \{c_{local}^{(1)}, \dots, c_{local}^{(L)}\}$, capturing each object's intended motion. We generate an initial segmentation mask $s^{(0)}$ from $x^{(0)}$ using SAM2. In Stage 1, the Image-to-Motion utilizes $x^{(0)}$, $s^{(0)}$, and $c_{motion}$ to generate mask-based motion trajectories $\hat{s}$ that represent object-specific movement paths. In Stage 2, the Motion-to-Video takes as input $x^{(0)}$, the generated trajectories $\hat{s}$, the text prompt $c$ as a global condition, and object-specific prompts $c_{local}$ through a masked attention blocks (Section \ref{['sec:motion_to_video']}), producing the final video $\hat{x}$.
  • Figure 3: Illustration of the masked attention block. Squares represent video latent patches, color-coded to indicate objects (e.g., cat or dog). Triangles denote prompt tokens: gray for global prompts and object-specific colors for local prompts. The pipeline features self-attention for all patches, masked self-attention restricted to each object, cross-attention integrating global prompts, and masked cross-attention aligning object-specific prompts.
  • Figure 4: Qualitative comparison: Visual examples of generated videos for Through-The-Mask compared to the TI2V baseline on examples from the SA-V-128 benchmark.
  • Figure 5: Qualitative comparison of generated videos using segmentation masks vs optical flow as an intermediate motion representation. The first row shows the input image and text, the second row displays the generated masks, and the third row presents the generated optical flow. The fourth and fifth rows show the generated videos, with the fourth row using our mask-based model and the fifth using our flow-based model.
  • ...and 6 more figures