Table of Contents
Fetching ...

3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

Hsiao-Yu Fish Tung, Zhou Xian, Mihir Prabhudesai, Shamit Lal, Katerina Fragkiadaki

TL;DR

This work introduces 3D-OES, a viewpoint-invariant, object-factorized dynamics framework that predicts 3D scene changes from RGB-D inputs. It combines a geometry-aware 2D-to-3D lifting (GRNNs), object-centric 3D feature maps, and a 3D graph neural network to forecast per-object motions, which are then warped to generate long-horizon scene predictions and decoded by a neural renderer for 2D visualization. The approach generalizes across varying numbers of objects, appearances, and camera viewpoints, outperforming 2D and 3D baselines and enabling effective sim-to-real transfers in pushing tasks with MPC. The work also demonstrates interpretable latent 3D simulations and counterfactual visualizations, highlighting practical potential for planning and robotics, while acknowledging limitations like requiring ground-truth 3D poses during training and assuming rigid objects.

Abstract

We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-invariant 3D neural scene representation space, inferred from RGB-D videos. In this 3D feature space, objects do not interfere with one another and their appearance persists over time and across viewpoints. This permits our model to predict future scenes long in the future by simply "moving" 3D object features based on cumulative object motion predictions. Object motion predictions are computed by a graph neural network that operates over the object features extracted from the 3D neural scene representation. Our model's simulations can be decoded by a neural renderer into2D image views from any desired viewpoint, which aids the interpretability of our latent 3D simulation space. We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints, outperforming existing 2D and 3D dynamics models. We further demonstrate sim-to-real transfer of the learnt dynamics by applying our model trained solely in simulation to model-based control for pushing objects to desired locations under clutter on a real robotic setup

3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

TL;DR

This work introduces 3D-OES, a viewpoint-invariant, object-factorized dynamics framework that predicts 3D scene changes from RGB-D inputs. It combines a geometry-aware 2D-to-3D lifting (GRNNs), object-centric 3D feature maps, and a 3D graph neural network to forecast per-object motions, which are then warped to generate long-horizon scene predictions and decoded by a neural renderer for 2D visualization. The approach generalizes across varying numbers of objects, appearances, and camera viewpoints, outperforming 2D and 3D baselines and enabling effective sim-to-real transfers in pushing tasks with MPC. The work also demonstrates interpretable latent 3D simulations and counterfactual visualizations, highlighting practical potential for planning and robotics, while acknowledging limitations like requiring ground-truth 3D poses during training and assuming rigid objects.

Abstract

We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-invariant 3D neural scene representation space, inferred from RGB-D videos. In this 3D feature space, objects do not interfere with one another and their appearance persists over time and across viewpoints. This permits our model to predict future scenes long in the future by simply "moving" 3D object features based on cumulative object motion predictions. Object motion predictions are computed by a graph neural network that operates over the object features extracted from the 3D neural scene representation. Our model's simulations can be decoded by a neural renderer into2D image views from any desired viewpoint, which aids the interpretability of our latent 3D simulation space. We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints, outperforming existing 2D and 3D dynamics models. We further demonstrate sim-to-real transfer of the learnt dynamics by applying our model trained solely in simulation to model-based control for pushing objects to desired locations under clutter on a real robotic setup

Paper Structure

This paper contains 20 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: 3D-OES predict 3D object motion under agent-object and object-object interactions, using a graph neural network over 3D feature maps of detected objects. Node features capture the appearance of an object node and its immediate context, and edge features capture relative 3D locations between two nodes, so the model is translational invariant. After message passing between nodes, the node and edge features are decoded to future 3D rotations and translations for each object.
  • Figure 2: Forward unrolling of our dynamics model and the graph-XYZ baseline. Left: pushing. Right: falling. In the top row, we show (randomly sampled) camera views that we use as input to our model. The second row shows the ground-truth motion of the object from the front view. Rows 3, 4 show the predicted object motion from our model and the graph-XYZ baseline from the same front camera viewpoint. Our model better matches the ground-truth object motion than the graph-XYZ baseline. The latter does not capture object appearance in any way.
  • Figure 3: Neurally rendered simulation videos of counterfactual experiments. The first row shows the ground truth simulation video from the dataset. Only the first frame in this video is used as input to our model to produce the predicted simulations. The second row shows the ground truth simulation from a query view. The third row shows the future prediction from our model given the input image. The following rows show the simulation after manipulating an objects (in the blue box) according the instruction on the left most column.
  • Figure 4: Collision-free pushing on a real-world setup. The task is to push a mouse to a target location without colliding into any obstacles. Our robot can successfully complete the task with 3 push attempts.
  • Figure 5: Neurally rendered simulation videos from three different views Left: groundtruth simulation videos from the dataset. The simulation is generated by the Bullet Physics Simulation. Right: neurally rendered simulation video from the proposed model. Our model forcasts the future latent feature by explicitly warping the latent 3D feature maps, and we pass these warped latent 3D feature maps through the learned 3D-to-2D image decoder to decode them into human interpretable images. We can render the images from any arbitrary views and the images are consistent across views.
  • ...and 3 more figures