Scaling Face Interaction Graph Networks to Real World Scenes
Tatiana Lopez-Guevara, Yulia Rubanova, William F. Whitney, Tobias Pfaff, Kimberly Stachenfeld, Kelsey R. Allen
TL;DR
The paper tackles the difficulty of scaling learned graph-based rigid-body simulators to complex real-world scenes and perceptual inputs. It introduces FIGNet*, a memory-efficient variant of FIGNet that removes surface mesh edges, enabling training on more intricate geometries while preserving trajectory accuracy. A perception bridge using Neural Radiance Fields (NeRF) is built to convert real scenes into meshes and to render FIGNet* rollouts by editing NeRF volumes with predicted object transforms. On synthetic Kubric datasets, FIGNet* achieves major memory and speed gains with similar accuracy, and qualitative real-world demonstrations show plausible dynamics when driven by perception-derived geometry. This work widens the applicability of learned simulators to perception-only inference, with potential impact on robotics, graphics, and design workflows.
Abstract
Accurately simulating real world object dynamics is essential for various applications such as robotics, engineering, graphics, and design. To better capture complex real dynamics such as contact and friction, learned simulators based on graph networks have recently shown great promise. However, applying these learned simulators to real scenes comes with two major challenges: first, scaling learned simulators to handle the complexity of real world scenes which can involve hundreds of objects each with complicated 3D shapes, and second, handling inputs from perception rather than 3D state information. Here we introduce a method which substantially reduces the memory required to run graph-based learned simulators. Based on this memory-efficient simulation model, we then present a perceptual interface in the form of editable NeRFs which can convert real-world scenes into a structured representation that can be processed by graph network simulator. We show that our method uses substantially less memory than previous graph-based simulators while retaining their accuracy, and that the simulators learned in synthetic environments can be applied to real world scenes captured from multiple camera angles. This paves the way for expanding the application of learned simulators to settings where only perceptual information is available at inference time.
