Table of Contents
Fetching ...

Scaling Face Interaction Graph Networks to Real World Scenes

Tatiana Lopez-Guevara, Yulia Rubanova, William F. Whitney, Tobias Pfaff, Kimberly Stachenfeld, Kelsey R. Allen

TL;DR

The paper tackles the difficulty of scaling learned graph-based rigid-body simulators to complex real-world scenes and perceptual inputs. It introduces FIGNet*, a memory-efficient variant of FIGNet that removes surface mesh edges, enabling training on more intricate geometries while preserving trajectory accuracy. A perception bridge using Neural Radiance Fields (NeRF) is built to convert real scenes into meshes and to render FIGNet* rollouts by editing NeRF volumes with predicted object transforms. On synthetic Kubric datasets, FIGNet* achieves major memory and speed gains with similar accuracy, and qualitative real-world demonstrations show plausible dynamics when driven by perception-derived geometry. This work widens the applicability of learned simulators to perception-only inference, with potential impact on robotics, graphics, and design workflows.

Abstract

Accurately simulating real world object dynamics is essential for various applications such as robotics, engineering, graphics, and design. To better capture complex real dynamics such as contact and friction, learned simulators based on graph networks have recently shown great promise. However, applying these learned simulators to real scenes comes with two major challenges: first, scaling learned simulators to handle the complexity of real world scenes which can involve hundreds of objects each with complicated 3D shapes, and second, handling inputs from perception rather than 3D state information. Here we introduce a method which substantially reduces the memory required to run graph-based learned simulators. Based on this memory-efficient simulation model, we then present a perceptual interface in the form of editable NeRFs which can convert real-world scenes into a structured representation that can be processed by graph network simulator. We show that our method uses substantially less memory than previous graph-based simulators while retaining their accuracy, and that the simulators learned in synthetic environments can be applied to real world scenes captured from multiple camera angles. This paves the way for expanding the application of learned simulators to settings where only perceptual information is available at inference time.

Scaling Face Interaction Graph Networks to Real World Scenes

TL;DR

The paper tackles the difficulty of scaling learned graph-based rigid-body simulators to complex real-world scenes and perceptual inputs. It introduces FIGNet*, a memory-efficient variant of FIGNet that removes surface mesh edges, enabling training on more intricate geometries while preserving trajectory accuracy. A perception bridge using Neural Radiance Fields (NeRF) is built to convert real scenes into meshes and to render FIGNet* rollouts by editing NeRF volumes with predicted object transforms. On synthetic Kubric datasets, FIGNet* achieves major memory and speed gains with similar accuracy, and qualitative real-world demonstrations show plausible dynamics when driven by perception-derived geometry. This work widens the applicability of learned simulators to perception-only inference, with potential impact on robotics, graphics, and design workflows.

Abstract

Accurately simulating real world object dynamics is essential for various applications such as robotics, engineering, graphics, and design. To better capture complex real dynamics such as contact and friction, learned simulators based on graph networks have recently shown great promise. However, applying these learned simulators to real scenes comes with two major challenges: first, scaling learned simulators to handle the complexity of real world scenes which can involve hundreds of objects each with complicated 3D shapes, and second, handling inputs from perception rather than 3D state information. Here we introduce a method which substantially reduces the memory required to run graph-based learned simulators. Based on this memory-efficient simulation model, we then present a perceptual interface in the form of editable NeRFs which can convert real-world scenes into a structured representation that can be processed by graph network simulator. We show that our method uses substantially less memory than previous graph-based simulators while retaining their accuracy, and that the simulators learned in synthetic environments can be applied to real world scenes captured from multiple camera angles. This paves the way for expanding the application of learned simulators to settings where only perceptual information is available at inference time.
Paper Structure (28 sections, 7 equations, 12 figures, 3 tables)

This paper contains 28 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Architectural changes: FIGNet* with respect to FIGNet.
  • Figure 2: Perception Pipeline. We demonstrate a two-way coupling approach, integrating FIGNet* with real-world scenes through NeRF. Initially, a static NeRF scene is trained using a collection of images capturing a real-world scene, enabling the extraction of the necessary meshes for FIGNet*. Upon obtaining the rollout trajectory, we derive a set of rigid body transformations, which are then utilized to edit the original NeRF. See \ref{['sec:figreal']} for details.
  • Figure 3: Qualitative results for simulation. FIGNet* rollout for complex MOVi-C simulation which could not be represented in memory for FIGNet.
  • Figure 4: Qualitative results for real world scenes.Left: Initial NeRF rendering of the static real-world scene. The desired active object is outlined in red, with a red arrow indicating its intended starting position. Right: FIGNet* rollouts simulating the object's motion for $k=30$ time steps (rendered from a different viewpoint) after being dropped from the initial position. The complete trajectory is traced in yellow. Here we used $b_{duplicate}$ as the ray bending function meaning the active object is copy pasted into the starting position at the beginning of the rollout (See the website for videos and \ref{['ap:real:segmentation']} for details on the mesh extraction procedure described in \ref{['sec:figreal']}).
  • Figure 5: FIGNet and FIGNet* comparison for different levels of decimation: High-quality meshes lead to out-of-memory issues on FIGNet, while lower resolutions result in implausible trajectories (e.g., orange penetrating the basket). Notably, FIGNet*'s performance gracefully degrades with mesh quality, indicating enhanced robustness and memory efficiency. The gray mesh depicts the passive object, and the colored mesh corresponds to the active object.
  • ...and 7 more figures