Table of Contents
Fetching ...

Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance

Sicong Feng, Jielong Yang, Li Peng

TL;DR

The paper addresses the data and consistency challenges in text-to-video generation by proposing a mask-guided framework that uses foreground mask motion sequences to steer diffusion-based video synthesis. It introduces a mask-guided attention mechanism and a mask-cross-attention module, along with a first-frame conditioning and a shared-noise strategy to enable autoregressive, long-video generation from limited data. Training relies on a latent diffusion model with a first-frame conditional loss $L = E_{x,ε,t,c_p}[||ε_{2:n}-ε^{θ}_{2:n}(x_t,t,c_p)||_2^2]$, and inference leverages ControlNet for the initial frame and mask-driven generation for subsequent frames. Across qualitative and quantitative evaluations, the approach outperforms baselines in text alignment and frame consistency, demonstrating practical gains for video editing and artistic generation in data-constrained settings.

Abstract

Recent advances in diffusion models bring new vitality to visual content creation. However, current text-to-video generation models still face significant challenges such as high training costs, substantial data requirements, and difficulties in maintaining consistency between given text and motion of the foreground object. To address these challenges, we propose mask-guided video generation, which can control video generation through mask motion sequences, while requiring limited training data. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. Through mask motion sequences, we guide the video generation process to maintain consistent foreground objects throughout the sequence. Additionally, through a first-frame sharing strategy and autoregressive extension approach, we achieve more stable and longer video generation. Extensive qualitative and quantitative experiments demonstrate that this approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality. Our generated results can be viewed in the supplementary materials.

Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance

TL;DR

The paper addresses the data and consistency challenges in text-to-video generation by proposing a mask-guided framework that uses foreground mask motion sequences to steer diffusion-based video synthesis. It introduces a mask-guided attention mechanism and a mask-cross-attention module, along with a first-frame conditioning and a shared-noise strategy to enable autoregressive, long-video generation from limited data. Training relies on a latent diffusion model with a first-frame conditional loss , and inference leverages ControlNet for the initial frame and mask-driven generation for subsequent frames. Across qualitative and quantitative evaluations, the approach outperforms baselines in text alignment and frame consistency, demonstrating practical gains for video editing and artistic generation in data-constrained settings.

Abstract

Recent advances in diffusion models bring new vitality to visual content creation. However, current text-to-video generation models still face significant challenges such as high training costs, substantial data requirements, and difficulties in maintaining consistency between given text and motion of the foreground object. To address these challenges, we propose mask-guided video generation, which can control video generation through mask motion sequences, while requiring limited training data. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. Through mask motion sequences, we guide the video generation process to maintain consistent foreground objects throughout the sequence. Additionally, through a first-frame sharing strategy and autoregressive extension approach, we achieve more stable and longer video generation. Extensive qualitative and quantitative experiments demonstrate that this approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality. Our generated results can be viewed in the supplementary materials.

Paper Structure

This paper contains 11 sections, 8 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our model generates various videos consistent with the foreground mask and text prompts, delivering satisfactory results.
  • Figure 2: In LAMPwu2024lamp, the model fails to capture the horse's movement direction and distinguish foreground-background properly. Our model addresses these issues by accurately tracking motion and maintaining clear foreground-background separation.
  • Figure 3: Overall framework of our mask-guided video generation method. We apply trainable temporal-spatial self-attention and mask cross-attention within the U-Net, enabling the model to focus more on the foreground.
  • Figure 4: Auto-Regressive Generation. Our model is capable of generating long videos. Given the text prompt "A horse runs across a flat desert plain under a midday sun in a pop art painting style," the video frames are generated using a first-frame-based method, producing a 24-frame video after three epochs.
  • Figure 5: Comparison between our method and baselines.
  • ...and 1 more figures