Table of Contents
Fetching ...

Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics

Tianshuo Xu, Zhifei Chen, Leyi Wu, Hao Lu, Ying-cong Chen

TL;DR

The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability, and this framework designed to stabilize this trilemma even in complex generative tasks is introduced.

Abstract

The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability. While recent models can maintain this balance in simple, isolated scenarios, we observe that this equilibrium is fragile and often breaks down as scene complexity increases (e.g., involving collisions or dense traffic). To address this, we introduce \textbf{Motion Forcing}, a framework designed to stabilize this trilemma even in complex generative tasks. Our key insight is to explicitly decouple physical reasoning from visual synthesis via a hierarchical \textbf{``Point-Shape-Appearance''} paradigm. This approach decomposes generation into verifiable stages: modeling complex dynamics as sparse geometric anchors (\textbf{Point}), expanding them into dynamic depth maps that explicitly resolve 3D geometry (\textbf{Shape}), and finally rendering high-fidelity textures (\textbf{Appearance}). Furthermore, to foster robust physical understanding, we employ a \textbf{Masked Point Recovery} strategy. By randomly masking input anchors during training and enforcing the reconstruction of complete dynamic depth, the model is compelled to move beyond passive pattern matching and learn latent physical laws (e.g., inertia) to infer missing trajectories. Extensive experiments on autonomous driving benchmarks show that Motion Forcing significantly outperforms state-of-the-art baselines, maintaining trilemma stability across complex scenes. Evaluations on physics and robotics further confirm our framework's generality.

Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics

TL;DR

The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability, and this framework designed to stabilize this trilemma even in complex generative tasks is introduced.

Abstract

The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability. While recent models can maintain this balance in simple, isolated scenarios, we observe that this equilibrium is fragile and often breaks down as scene complexity increases (e.g., involving collisions or dense traffic). To address this, we introduce \textbf{Motion Forcing}, a framework designed to stabilize this trilemma even in complex generative tasks. Our key insight is to explicitly decouple physical reasoning from visual synthesis via a hierarchical \textbf{``Point-Shape-Appearance''} paradigm. This approach decomposes generation into verifiable stages: modeling complex dynamics as sparse geometric anchors (\textbf{Point}), expanding them into dynamic depth maps that explicitly resolve 3D geometry (\textbf{Shape}), and finally rendering high-fidelity textures (\textbf{Appearance}). Furthermore, to foster robust physical understanding, we employ a \textbf{Masked Point Recovery} strategy. By randomly masking input anchors during training and enforcing the reconstruction of complete dynamic depth, the model is compelled to move beyond passive pattern matching and learn latent physical laws (e.g., inertia) to infer missing trajectories. Extensive experiments on autonomous driving benchmarks show that Motion Forcing significantly outperforms state-of-the-art baselines, maintaining trilemma stability across complex scenes. Evaluations on physics and robotics further confirm our framework's generality.
Paper Structure (26 sections, 6 equations, 6 figures, 2 tables)

This paper contains 26 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our framework, Motion Forcing, taking user input motion and the first frame, generates motion-coherent future frames. (a-d) showcase the model's ability to generate reactive ego-trajectories that respond dynamically to diverse dangerous scenarios initiated by other vehicles. (e) showcases a physics scene where different actions lead to different collision results. (f) illustrates embodied AI, where providing different directional inputs allows a robotic hand to move an object in the corresponding directions. Full videos are available at the project repository.
  • Figure 2: Overview of the Motion Forcing framework.(a) Preparation of Motion Representations: Input control signals are processed and subjected to spatial and temporal masking before being fed into the model. (b) Training: The model is trained by randomly sampling between two independent stages: "Point $\to$ Shape" (depth generation) and "Shape $\to$ Appearance" (RGB rendering). (c) Inference: The generation follows a complete two-stage hierarchical process, sequentially mapping sparse points to structural depth (Shape), and finally to the target RGB video (Appearance).
  • Figure 3: Comparison with state-of-the-art models in synthesizing cut-in and evasive driving scenarios. While the leading closed-source models (Seed Dance2.0, Wan2.6) rely on text prompts, trajectory-controllable baselines like MOFA-Video require object trajectories coupled with scene trajectories to simulate camera movement. Our method Motion Forcing, however, effectively captures these complex dynamics given only the initial ego and object motion trajectories.
  • Figure 4: Validation of physical capabilities.✓ indicate physically coherent. MOFA-Video is fine-tuned on our exact dataset for fairness. As the control commands become more complex, only our method maintains stable and physically coherent generation.
  • Figure 5: Qualitative comparison of ego motion control. Compared to the baseline AdaLN approach, our Depth Warping-based method demonstrates significantly superior accuracy and flexibility in controlling both the direction and speed.
  • ...and 1 more figures