Table of Contents
Fetching ...

Learning to Generate Rigid Body Interactions with Video Diffusion Models

David Romero, Ariana Bermudez, Hao Li, Fabio Pizzati, Ivan Laptev

TL;DR

This work addresses the challenge of generating physically plausible multi-object interactions in videos by introducing KineMask, a two-stage motion-control framework that learns object-level dynamics from initial conditions. It combines low-level velocity-mask conditioning via ControlNet with high-level textual prompts, all trained on synthetic data and evaluated on real scenes to demonstrate generalization. The approach yields strong improvements in object interactions, causality emergence, and motion fidelity over baselines of similar size, with ablations validating the complementary roles of data, two-stage training, and text conditioning. The work advances world-modeling capabilities for robotics and embodied decision making by providing controllable, physically-aware video synthesis and a path toward richer multimodal scene understanding.

Abstract

Recent video generation models have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack object-level control mechanisms. To address these limitations, we introduce KineMask, an approach for video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predicted scene descriptions, leading to support for synthesis of complex dynamical phenomena. Our experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available. Project Page: https://daromog.github.io/KineMask/

Learning to Generate Rigid Body Interactions with Video Diffusion Models

TL;DR

This work addresses the challenge of generating physically plausible multi-object interactions in videos by introducing KineMask, a two-stage motion-control framework that learns object-level dynamics from initial conditions. It combines low-level velocity-mask conditioning via ControlNet with high-level textual prompts, all trained on synthetic data and evaluated on real scenes to demonstrate generalization. The approach yields strong improvements in object interactions, causality emergence, and motion fidelity over baselines of similar size, with ablations validating the complementary roles of data, two-stage training, and text conditioning. The work advances world-modeling capabilities for robotics and embodied decision making by providing controllable, physically-aware video synthesis and a path toward richer multimodal scene understanding.

Abstract

Recent video generation models have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack object-level control mechanisms. To address these limitations, we introduce KineMask, an approach for video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predicted scene descriptions, leading to support for synthesis of complex dynamical phenomena. Our experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available. Project Page: https://daromog.github.io/KineMask/

Paper Structure

This paper contains 56 sections, 7 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: KineMask results. We enable object-based control with a novel training strategy. Paired with synthetic data constructed for the task, KineMask enables pretrained diffusion models to synthesize realistic rigid body interactions in real-world input scenes.
  • Figure 2: KineMask pipeline. We encode our low-level control signal as a mask encoding the velocity of the moving objects, to train a ControlNet (left) in two stages using Blender-generated videos of objects in motion. In the first one, we train with all frames, whereas in the second one, we randomly drop part of the final frames. We also provide a high-level textual control extracted by a VLM. At inference (right), we construct the low-level conditioning with SAM and use GPT to infer high-level outcomes of object motion from a single frame.
  • Figure 3: Qualitative comparison with CogVideoX. While CogVideoX often suffers from several failure modes, such as hallucinations and incorrect motions, KineMask follows target motion and generates realistic object interactions. In details, we improve object interactions in collisions (top row), show causal effects of object motion (bottom left), and move multiple objects (bottom right).
  • Figure 4: User study. We widely outperform baselines on motion fidelity, interaction quality, and overall physical consistency.
  • Figure 5: Degrees of freedom. We show control of different aspects of KineMask outputs. We can choose different directions (left), speed (middle), and objects to move (right), opening potential for world modeling.
  • ...and 12 more figures