Table of Contents
Fetching ...

Animate Your Motion: Turning Still Images into Dynamic Videos

Mingxiao Li, Bo Wan, Marie-Francine Moens, Tinne Tuytelaars

TL;DR

This work addresses the limitation of single-modality conditioning in diffusion-based video generation by introducing SMCD, a Scene and Motion Conditional Diffusion model that jointly leverages an initial image, object trajectories, and text. SMCD adds two modules, the Motion Integration Module and the Dual Image Integration Module, to encode dynamic trajectories and preserve semantic image content within a pretrained diffusion backbone, using a two-stage training regime to avoid interference between modalities. Empirical results on GOT10K and YTVIS2021 show SMCD outperforms baselines in video quality and grounding accuracy, with image integration strategies illustrating a strong synergy when combining ZC and gated cross-attention. The work highlights practical pathways for interactive, multimodal video generation while acknowledging limitations such as camera motion confounds and human-generation constraints of the backbone, proposing future work on camera constraints and higher frame-rate training.

Abstract

In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig 1. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.

Animate Your Motion: Turning Still Images into Dynamic Videos

TL;DR

This work addresses the limitation of single-modality conditioning in diffusion-based video generation by introducing SMCD, a Scene and Motion Conditional Diffusion model that jointly leverages an initial image, object trajectories, and text. SMCD adds two modules, the Motion Integration Module and the Dual Image Integration Module, to encode dynamic trajectories and preserve semantic image content within a pretrained diffusion backbone, using a two-stage training regime to avoid interference between modalities. Empirical results on GOT10K and YTVIS2021 show SMCD outperforms baselines in video quality and grounding accuracy, with image integration strategies illustrating a strong synergy when combining ZC and gated cross-attention. The work highlights practical pathways for interactive, multimodal video generation while acknowledging limitations such as camera motion confounds and human-generation constraints of the backbone, proposing future work on camera constraints and higher frame-rate training.

Abstract

In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig 1. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.
Paper Structure (34 sections, 9 equations, 7 figures, 5 tables)

This paper contains 34 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Scene motion customized video generation results of our proposed model. Our model accepts an initial frame image, a sequence of bounding boxes, and text as inputs to generate the desired videos that comply with the given constraints. The red arrows in the image indicate the moving directions of the objects.
  • Figure 1: Scene motion customized video generation results of our proposed SMCD model.
  • Figure 2: Model Illustration: SMCD handles three control signals including images, bounding box sequences, and text. It builds on a pre-trained T2V model, enriched with an object-gated self-attention layer, image-gated cross-attention layer, and a zero initialized convolution layer. These enhancements allow it to adapt to bounding box and image conditions.
  • Figure 2: Qualitative comparison of videos generated by different models. The given caption is: A hippopotamus that is walking.
  • Figure 3: Qualitative comparison of videos generated by different models. The given caption is: A hippopotamus that is walking. Our dual image integration module, labeled ZC+GCA, excels in maintaining the complex semantics of the initial frame while precisely conforming to the motion dynamics defined by the sequence of provided bounding boxes. Additionally, it ensures the retention of high video quality.
  • ...and 2 more figures