Table of Contents
Fetching ...

Motion Dreamer: Boundary Conditional Motion Reasoning for Physically Coherent Video Generation

Tianshuo Xu, Zhifei Chen, Leyi Wu, Hao Lu, Yuying Chen, Lihui Jiang, Bingbing Liu, Yingcong Chen

TL;DR

Motion Dreamer tackles boundary conditional motion reasoning by explicitly separating motion inference from visual synthesis. It introduces instance flow to translate partial user cues into dense, physically coherent motion fields and employs a motion inpainting strategy to infer missing dynamics, guided by a two-stage diffusion+decoder pipeline built on CogVideoX. Evaluations on Physion and a large driving dataset show superior motion coherence and realism compared with state-of-the-art methods, with ablations validating the value of intermediate motion representations and motion-enhancement losses. The approach advances practical boundary-conditioned video generation for autonomous driving and embodied AI, with code and data forthcoming.

Abstract

Recent advances in video generation have shown promise for generating future scenarios, critical for planning and control in autonomous driving and embodied intelligence. However, real-world applications demand more than visually plausible predictions; they require reasoning about object motions based on explicitly defined boundary conditions, such as initial scene image and partial object motion. We term this capability Boundary Conditional Motion Reasoning. Current approaches either neglect explicit user-defined motion constraints, producing physically inconsistent motions, or conversely demand complete motion inputs, which are rarely available in practice. Here we introduce Motion Dreamer, a two-stage framework that explicitly separates motion reasoning from visual synthesis, addressing these limitations. Our approach introduces instance flow, a sparse-to-dense motion representation enabling effective integration of partial user-defined motions, and the motion inpainting strategy to robustly enable reasoning motions of other objects. Extensive experiments demonstrate that Motion Dreamer significantly outperforms existing methods, achieving superior motion plausibility and visual realism, thus bridging the gap towards practical boundary conditional motion reasoning. Our webpage is available: https://envision-research.github.io/MotionDreamer/.

Motion Dreamer: Boundary Conditional Motion Reasoning for Physically Coherent Video Generation

TL;DR

Motion Dreamer tackles boundary conditional motion reasoning by explicitly separating motion inference from visual synthesis. It introduces instance flow to translate partial user cues into dense, physically coherent motion fields and employs a motion inpainting strategy to infer missing dynamics, guided by a two-stage diffusion+decoder pipeline built on CogVideoX. Evaluations on Physion and a large driving dataset show superior motion coherence and realism compared with state-of-the-art methods, with ablations validating the value of intermediate motion representations and motion-enhancement losses. The approach advances practical boundary-conditioned video generation for autonomous driving and embodied AI, with code and data forthcoming.

Abstract

Recent advances in video generation have shown promise for generating future scenarios, critical for planning and control in autonomous driving and embodied intelligence. However, real-world applications demand more than visually plausible predictions; they require reasoning about object motions based on explicitly defined boundary conditions, such as initial scene image and partial object motion. We term this capability Boundary Conditional Motion Reasoning. Current approaches either neglect explicit user-defined motion constraints, producing physically inconsistent motions, or conversely demand complete motion inputs, which are rarely available in practice. Here we introduce Motion Dreamer, a two-stage framework that explicitly separates motion reasoning from visual synthesis, addressing these limitations. Our approach introduces instance flow, a sparse-to-dense motion representation enabling effective integration of partial user-defined motions, and the motion inpainting strategy to robustly enable reasoning motions of other objects. Extensive experiments demonstrate that Motion Dreamer significantly outperforms existing methods, achieving superior motion plausibility and visual realism, thus bridging the gap towards practical boundary conditional motion reasoning. Our webpage is available: https://envision-research.github.io/MotionDreamer/.

Paper Structure

This paper contains 22 sections, 16 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Our framework, Motion Dreamer, taking user input motion and the first frame, can successfully generate the motion-coherent future frames. (a) and (b) showcasing the different degrees of motion results in different object contact times and the momentum it carries. (c) demonstrates that assigning arrows with different directions to different blocks can result in various domino collapse outcomes. (d) is the autonomous driving case, where the yellow arrow indicates the camera motion. Given the right arrow, the white car gradually leans towards the right.
  • Figure 2: Overview of the Motion Dreamer pipeline. The "Instance Flows (Full)" shown in the figure incorporates our proposed instance flow along with several intermediate motion representations, such as segmentation maps. The symbol $\mathcal{L}$ denotes the loss function computed between the predicted instance flow and the ground-truth instance flow. The different color of the Flow Arrow indicates the different direction of the flow.
  • Figure 3: Comparisons with state-of-the-art video editing approaches on the Physion bear2021physion dataset. One-stage$^*$ refers to the simplified one-stage version of our method. Our model demonstrates the ability to generate physically coherent results.
  • Figure 4: Illustration of reasoning-based motion generation in a driving scenario. We control the forward and backward movements of the lead car in various cases. Compared with MOFA-Video niu2024mofa, our model produces more realistic and reasonable outcomes. Flow$^*$ in MOFA-Video represents the optical flow generated by the sparse-to-dense module, while in our work it denotes Instance Flow. (Zoom in for optimal viewing).
  • Figure 5: The example of our collected interactive driving data from YouTube.
  • ...and 2 more figures