Table of Contents
Fetching ...

gradSim: Differentiable simulation for system identification and visuomotor control

Krishna Murthy Jatavallabhula, Miles Macklin, Florian Golemo, Vikram Voleti, Linda Petrini, Martin Weiss, Breandan Considine, Jerome Parent-Levesque, Kevin Xie, Kenny Erleben, Liam Paull, Florian Shkurti, Derek Nowrouzezahrai, Sanja Fidler

TL;DR

gradSim tackles the ill-posed problem of inferring physical properties from video by jointly modeling scene dynamics and image formation with differentiable physics and rendering. By backpropagating from pixels through a unified simulator, it enables end-to-end estimation of mass, friction, elasticity for rigid, deformable, and cloth objects without 3D supervision, and supports visuomotor control using image-space targets. The experiments show accurate parameter identification and effective image-based control, achieving competitive performance relative to 3D-supervised baselines and highlighting smooth loss landscapes conducive to gradient-based optimization. This work offers a scalable, interpretable path toward physics-aware video understanding and vision-guided control, with potential impact on robotics and graphics.

Abstract

We consider the problem of estimating an object's physical properties such as mass, friction, and elasticity directly from video sequences. Such a system identification problem is fundamentally ill-posed due to the loss of information during image formation. Current solutions require precise 3D labels which are labor-intensive to gather, and infeasible to create for many systems such as deformable solids or cloth. We present gradSim, a framework that overcomes the dependence on 3D supervision by leveraging differentiable multiphysics simulation and differentiable rendering to jointly model the evolution of scene dynamics and image formation. This novel combination enables backpropagation from pixels in a video sequence through to the underlying physical attributes that generated them. Moreover, our unified computation graph -- spanning from the dynamics and through the rendering process -- enables learning in challenging visuomotor control tasks, without relying on state-based (3D) supervision, while obtaining performance competitive to or better than techniques that rely on precise 3D labels.

gradSim: Differentiable simulation for system identification and visuomotor control

TL;DR

gradSim tackles the ill-posed problem of inferring physical properties from video by jointly modeling scene dynamics and image formation with differentiable physics and rendering. By backpropagating from pixels through a unified simulator, it enables end-to-end estimation of mass, friction, elasticity for rigid, deformable, and cloth objects without 3D supervision, and supports visuomotor control using image-space targets. The experiments show accurate parameter identification and effective image-based control, achieving competitive performance relative to 3D-supervised baselines and highlighting smooth loss landscapes conducive to gradient-based optimization. This work offers a scalable, interpretable path toward physics-aware video understanding and vision-guided control, with potential impact on robotics and graphics.

Abstract

We consider the problem of estimating an object's physical properties such as mass, friction, and elasticity directly from video sequences. Such a system identification problem is fundamentally ill-posed due to the loss of information during image formation. Current solutions require precise 3D labels which are labor-intensive to gather, and infeasible to create for many systems such as deformable solids or cloth. We present gradSim, a framework that overcomes the dependence on 3D supervision by leveraging differentiable multiphysics simulation and differentiable rendering to jointly model the evolution of scene dynamics and image formation. This novel combination enables backpropagation from pixels in a video sequence through to the underlying physical attributes that generated them. Moreover, our unified computation graph -- spanning from the dynamics and through the rendering process -- enables learning in challenging visuomotor control tasks, without relying on state-based (3D) supervision, while obtaining performance competitive to or better than techniques that rely on precise 3D labels.

Paper Structure

This paper contains 42 sections, 6 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: $\nabla$Sim is a unified differentiable rendering and multiphysics framework that allows solving a range of control and parameter estimation tasks (rigid bodies, deformable solids, and cloth) directly from images/video.
  • Figure 2: $\nabla$Sim: Given video observations of an evolving physical system (e), we randomly initialize scene object properties (a) and evolve them over time using a differentiable physics engine (b), which generates states. Our renderer (c) processes states, object vertices and global rendering parameters to produce image frames for computing our loss. We backprop through this computation graph to estimate physical attributes and controls. Existing methods rely solely on differentiable physics engines and require supervision in state-space (f), while $\nabla$Sim only needs image-space supervision (g).
  • Figure 3: Parameter Estimation: For deformable experiments, we optimize the material properties of a beam to match a video of a beam hanging under gravity. In the rigid experiments, we estimate contact parameters (elasticity/friction) and object density to match a video (GT). We visualize entire time sequences (t) with color-coded blends.
  • Figure 4: Loss landscapes when optimizing for physical attributes using $\nabla$Sim. (Left) When estimating the mass of a rigid-body with known shape using $\nabla$Sim, despite images being formed by a highly nonlinear process (simulation), the loss landscape is remarkably smooth, for a range of initialization errors. (Right) when optimizing for the elasicity parameters of a deformable FEM solid. Both the Lamé parameters $\lambda$ and $\mu$ are set to $1000$, where the MSE loss has a unique, dominant minimum. Note that, for fair comparison, the ground-truth for our PyBullet+REINFORCE baseline was generated using the PyBullet engine.
  • Figure 5: Visuomotor Control: $\nabla$Sim provides gradients suitable for diverse, complex visuomotor control tasks. For control-fem and control-walker experiments, we train a neural network to actuate a soft body towards a target image (GT). For control-cloth, we optimize the cloth's initial velocity to hit a target (GT) (specified as an image), under nonlinear lift/drag forces.
  • ...and 8 more figures