Table of Contents
Fetching ...

ContactGaussian-WM: Learning Physics-Grounded World Model from Videos

Meizhong Wang, Wanxin Jin, Kun Cao, Lihua Xie, Yiguang Hong

TL;DR

ContactGaussian-WM tackles learning physics-grounded world models from sparse, contact-rich video data to support planning and simulation in robotics. It introduces a unified Gaussian representation for both geometry and appearance and enables end-to-end differentiable learning by differentiating through a closed-form physics engine, using Stage I SG-GS initialization and Stage II Phys-Geo refinement. The paper shows strong generalization in simulation and real-world tests, outperforming data-driven and prior physics-based baselines, and demonstrates practical use in data synthesis and real-time MPC. This work advances robust sim-to-real transfer and long-horizon prediction in contact-rich environments.

Abstract

Developing world models that understand complex physical interactions is essential for advancing robotic planning and simulation.However, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact-rich dynamic motion.To address these challenges, we propose ContactGaussian-WM, a differentiable physics-grounded rigid-body world model capable of learning intricate physical laws directly from sparse and contact-rich video sequences.Our framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end-to-end differentiable learning framework that differentiates through a closed-form physics engine to infer physical properties from sparse visual observations.Extensive simulations and real-world evaluations demonstrate that ContactGaussian-WM outperforms state-of-the-art methods in learning complex scenarios, exhibiting robust generalization capabilities.Furthermore, we showcase the practical utility of our framework in downstream applications, including data synthesis and real-time MPC.

ContactGaussian-WM: Learning Physics-Grounded World Model from Videos

TL;DR

ContactGaussian-WM tackles learning physics-grounded world models from sparse, contact-rich video data to support planning and simulation in robotics. It introduces a unified Gaussian representation for both geometry and appearance and enables end-to-end differentiable learning by differentiating through a closed-form physics engine, using Stage I SG-GS initialization and Stage II Phys-Geo refinement. The paper shows strong generalization in simulation and real-world tests, outperforming data-driven and prior physics-based baselines, and demonstrates practical use in data synthesis and real-time MPC. This work advances robust sim-to-real transfer and long-horizon prediction in contact-rich environments.

Abstract

Developing world models that understand complex physical interactions is essential for advancing robotic planning and simulation.However, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact-rich dynamic motion.To address these challenges, we propose ContactGaussian-WM, a differentiable physics-grounded rigid-body world model capable of learning intricate physical laws directly from sparse and contact-rich video sequences.Our framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end-to-end differentiable learning framework that differentiates through a closed-form physics engine to infer physical properties from sparse visual observations.Extensive simulations and real-world evaluations demonstrate that ContactGaussian-WM outperforms state-of-the-art methods in learning complex scenarios, exhibiting robust generalization capabilities.Furthermore, we showcase the practical utility of our framework in downstream applications, including data synthesis and real-time MPC.
Paper Structure (31 sections, 14 equations, 10 figures, 4 tables)

This paper contains 31 sections, 14 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of ContactGaussian-WM. We first initialize a scene with a unified spherical Gaussian representation, then jointly refine physics and geometry: given the current physical state and action, the differentiable collision detector uses Gaussian geometry to compute contact points, which are fed into the complementarity-free contact dynamics model to compute the next physical state, and the 3DGS renderer generates the next image. The pipeline is fully differentiable for end-to-end learning.
  • Figure 2: PSNR curves during training for Fall-and-rebound (left) and Push-slide-settle (right). DreamerV3 reports closed-loop one-step prediction, while the other methods report open-loop cumulative error.
  • Figure 3: Our real-world experiment setup
  • Figure 4: Visualization of long-horizon predictions in real-world experiments. Our method (right) outperforms the one without optimization (middle), as compared with the ground-truth observation (left).
  • Figure 5: Time-series visualization of applying MPC on the LEAP Hand to reorient the rubber duck in MuJoCo.
  • ...and 5 more figures