Table of Contents
Fetching ...

Closing the Visual Sim-to-Real Gap with Object-Composable NeRFs

Nikhil Mishra, Maximilian Sieb, Pieter Abbeel, Xi Chen

TL;DR

COV-NeRF introduces an object-centric, generalizable neural renderer that treats a scene as a set of per-object volumes and a background, enabling real-to-sim data synthesis for photorealistic rendering and multi-modal supervision. It avoids per-scene optimization by leveraging cross-view attention and a Transformer-based decoder to produce consistent RGB, depth, and segmentation labels, and it supports scene generation via physics-based placement and Marching Cubes mesh extraction. The approach achieves competitive view synthesis quality without test-time optimization and substantially improves sim-to-real perception, demonstrated through a real-world bin-picking setup where limited real data combined with synthetic COV-NeRF data yields notable gains in grasping and segmentation metrics. This work provides a practical pipeline to rapidly generate diverse, labeled, photorealistic data targeted to real-world scenes and objects, with broad implications for robotics perception and sim-to-real robustness.

Abstract

Deep learning methods for perception are the cornerstone of many robotic systems. Despite their potential for impressive performance, obtaining real-world training data is expensive, and can be impractically difficult for some tasks. Sim-to-real transfer with domain randomization offers a potential workaround, but often requires extensive manual tuning and results in models that are brittle to distribution shift between sim and real. In this work, we introduce Composable Object Volume NeRF (COV-NeRF), an object-composable NeRF model that is the centerpiece of a real-to-sim pipeline for synthesizing training data targeted to scenes and objects from the real world. COV-NeRF extracts objects from real images and composes them into new scenes, generating photorealistic renderings and many types of 2D and 3D supervision, including depth maps, segmentation masks, and meshes. We show that COV-NeRF matches the rendering quality of modern NeRF methods, and can be used to rapidly close the sim-to-real gap across a variety of perceptual modalities.

Closing the Visual Sim-to-Real Gap with Object-Composable NeRFs

TL;DR

COV-NeRF introduces an object-centric, generalizable neural renderer that treats a scene as a set of per-object volumes and a background, enabling real-to-sim data synthesis for photorealistic rendering and multi-modal supervision. It avoids per-scene optimization by leveraging cross-view attention and a Transformer-based decoder to produce consistent RGB, depth, and segmentation labels, and it supports scene generation via physics-based placement and Marching Cubes mesh extraction. The approach achieves competitive view synthesis quality without test-time optimization and substantially improves sim-to-real perception, demonstrated through a real-world bin-picking setup where limited real data combined with synthetic COV-NeRF data yields notable gains in grasping and segmentation metrics. This work provides a practical pipeline to rapidly generate diverse, labeled, photorealistic data targeted to real-world scenes and objects, with broad implications for robotics perception and sim-to-real robustness.

Abstract

Deep learning methods for perception are the cornerstone of many robotic systems. Despite their potential for impressive performance, obtaining real-world training data is expensive, and can be impractically difficult for some tasks. Sim-to-real transfer with domain randomization offers a potential workaround, but often requires extensive manual tuning and results in models that are brittle to distribution shift between sim and real. In this work, we introduce Composable Object Volume NeRF (COV-NeRF), an object-composable NeRF model that is the centerpiece of a real-to-sim pipeline for synthesizing training data targeted to scenes and objects from the real world. COV-NeRF extracts objects from real images and composes them into new scenes, generating photorealistic renderings and many types of 2D and 3D supervision, including depth maps, segmentation masks, and meshes. We show that COV-NeRF matches the rendering quality of modern NeRF methods, and can be used to rapidly close the sim-to-real gap across a variety of perceptual modalities.
Paper Structure (11 sections, 3 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 3 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Our proposed object-centric neural renderer, COV-NeRF, can be used to generate targeted supervision for other models that are brittle to sim-to-real distribution shift. After learning explicit neural representations of real objects, COV-NeRF can compose those representations into photorealistic synthetic scenes and generate many modalities of downstream supervision, including depth maps, segmentation masks, and instance meshes.
  • Figure 2: An overview of COV-NeRF's object-centric rendering process. Visual features from the source views are projected into a feature volume for each object in the scene. For each pixel to be rendered, features are interpolated along the corresponding ray from each volume that the ray intersects with. A Transformer decodes the interpolated features into the NeRF density and radiance, which are composited into RGB colors, depths and segmentation masks.
  • Figure 3: Qualitative view synthesis results on two real scenes. For each method, an image is rendered from a novel viewpoint using 4 source views (not pictured). The ground-truth image from the novel viewpoint is show in the top row. COV-NeRF matches the performance of object-centric methods that require expensive, per-scene TTO, and outperforms other scene-generalizable methods.
  • Figure 4: Sample instance segmentation predictions from MaskDINO (top row) and stereo depth predictions from MVS-Former (bottom row) resulting from the sim-to-real methods evaluated in Table \ref{['table_sim_to_real_robot']}. COV-NeRF enables substantial improvement in both modalities.
  • Figure 5: (a) Representative real images from the Mixed-Clutter (top) and Hard-Specular (bottom) scenarios. (b) Sample simulated scenes from COB-3D-v2. (c) CyCADA cycada adapts the scenes from (b) to more closely resemble samples from the real world, but its cycle consistency objectives result in only a mild re-styling of the sim scenes. (d) DDIB ddib produces more visually realistic adaptions of (b), but violates the original scene semantics. (e) Instead of adapting simulated scenes, COV-NeRF composes new scenes using explicit object representations extracted from (a).