Table of Contents
Fetching ...

Training-free Motion Factorization for Compositional Video Generation

Zixuan Wang, Ziqin Zhou, Feng Chen, Duo Peng, Yixin Hu, Changsheng Li, Yinjie Lei

TL;DR

A motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion, which alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions.

Abstract

Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.

Training-free Motion Factorization for Compositional Video Generation

TL;DR

A motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion, which alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions.

Abstract

Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.
Paper Structure (13 sections, 16 equations, 5 figures, 4 tables)

This paper contains 13 sections, 16 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our motion factorization framework. First, for each instance belonging to a particular motion category, our framework infers its per-frame changes in shape and position from a structured motion graph (Sec. \ref{['cmpa']}). Second, conditioned on the motion category, dedicated guidance branches synthesize per-instance motions, which are subsequently composed into a coherent scene (Sec. \ref{['cdmg']}).
  • Figure 2: Overview of our Structured Motion Reasoning (SMR) module ( \ref{['cmpa']}). (a) Given a user prompt, we organize it into a motion graph describing instances and their interactions. (b) For each instance, conditioned on its motion category, we infer a bounding box sequence from graph-derived motion cues. All bounding box sequences are then composed into a coherent spatial-temporal layout.
  • Figure 3: Overview of Disentangled Motion Guidance (DMG) module ( \ref{['cdmg']}). (a) For motionless instances, we enforce each frame interacts only with a designated anchor frame. (b) For rigidly moving instances, we restrict cross-frame interactions of a foreground within the shape aligned regions. (c) For instances undergoing non-rigid movements, we minimize pixel-wise discrepancies between perceptual deformations and box-induced deformations.
  • Figure 4: Visualization comparisons under diverse motion categories, including motionlessness (top row), rigid motion (middle row), and non-rigid motion (bottom row). In 3D Unet architecture, we compare our framework with baseline VideoCrafter-v2.0 chen2024videocrafter2 and compositional approach A&R phung2024grounded. While in DiT architecture, we compare our framework with baseline CogVideoX-2B yang2024cogvideox and compositional approach R&P chen2024training. Our framework yields improved cross-frame consistency and motion fidelity across various scenarios.
  • Figure 5: Baseline model and our framework can hardly generate (a) rare semantic, and (b) emotional cues.