Table of Contents
Fetching ...

NeuralLabeling: A versatile toolset for labeling vision datasets using Neural Radiance Fields

Floris Erich, Naoya Chiba, Yusuke Yoshiyasu, Noriaki Ando, Ryo Hanai, Yukiyasu Domae

TL;DR

NeuralLabeling presents a NeRF-based labeling toolkit that enables efficient, 3D-consistent annotation of vision datasets from image sequences. By supporting bounding-box and mesh-based pipelines and leveraging NeRF occlusions, it yields rich outputs such as depth maps, 6DOF poses, and segmentation masks, including for challenging transparent objects. The authors create Dishwasher30k, a large supervised depth-completion dataset, and show that models trained on NeuralLabeling-generated data outperform weakly supervised baselines, with a practical robot demonstration achieving 83.3% grasp success for transparent glasses. Overall, the work demonstrates how NeRF-based labeling can accelerate large-scale, geometry-aware dataset creation with tangible benefits for perception and robotic manipulation in complex scenes, while noting the labeling time as a current bottleneck for broader deployment.

Abstract

We present NeuralLabeling, a labeling approach and toolset for annotating 3D scenes using either bounding boxes or meshes and generating segmentation masks, affordance maps, 2D bounding boxes, 3D bounding boxes, 6DOF object poses, depth maps, and object meshes. NeuralLabeling uses Neural Radiance Fields (NeRF) as a renderer, allowing labeling to be performed using 3D spatial tools while incorporating geometric clues such as occlusions, relying only on images captured from multiple viewpoints as input. To demonstrate the applicability of NeuralLabeling to a practical problem in robotics, we added ground truth depth maps to 30000 frames of transparent object RGB and noisy depth maps of glasses placed in a dishwasher captured using an RGBD sensor, yielding the Dishwasher30k dataset. We show that training a simple deep neural network with supervision using the annotated depth maps yields a higher reconstruction performance than training with the previously applied weakly supervised approach. We also show how instance segmentation and depth completion datasets generated using NeuralLabeling can be incorporated into a robot application for grasping transparent objects placed in a dishwasher with an accuracy of 83.3%, compared to 16.3% without depth completion.

NeuralLabeling: A versatile toolset for labeling vision datasets using Neural Radiance Fields

TL;DR

NeuralLabeling presents a NeRF-based labeling toolkit that enables efficient, 3D-consistent annotation of vision datasets from image sequences. By supporting bounding-box and mesh-based pipelines and leveraging NeRF occlusions, it yields rich outputs such as depth maps, 6DOF poses, and segmentation masks, including for challenging transparent objects. The authors create Dishwasher30k, a large supervised depth-completion dataset, and show that models trained on NeuralLabeling-generated data outperform weakly supervised baselines, with a practical robot demonstration achieving 83.3% grasp success for transparent glasses. Overall, the work demonstrates how NeRF-based labeling can accelerate large-scale, geometry-aware dataset creation with tangible benefits for perception and robotic manipulation in complex scenes, while noting the labeling time as a current bottleneck for broader deployment.

Abstract

We present NeuralLabeling, a labeling approach and toolset for annotating 3D scenes using either bounding boxes or meshes and generating segmentation masks, affordance maps, 2D bounding boxes, 3D bounding boxes, 6DOF object poses, depth maps, and object meshes. NeuralLabeling uses Neural Radiance Fields (NeRF) as a renderer, allowing labeling to be performed using 3D spatial tools while incorporating geometric clues such as occlusions, relying only on images captured from multiple viewpoints as input. To demonstrate the applicability of NeuralLabeling to a practical problem in robotics, we added ground truth depth maps to 30000 frames of transparent object RGB and noisy depth maps of glasses placed in a dishwasher captured using an RGBD sensor, yielding the Dishwasher30k dataset. We show that training a simple deep neural network with supervision using the annotated depth maps yields a higher reconstruction performance than training with the previously applied weakly supervised approach. We also show how instance segmentation and depth completion datasets generated using NeuralLabeling can be incorporated into a robot application for grasping transparent objects placed in a dishwasher with an accuracy of 83.3%, compared to 16.3% without depth completion.
Paper Structure (17 sections, 5 figures, 4 tables)

This paper contains 17 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: NeuralLabeling supports two pipelines for labeling NeRFs: Bounding-box-based labeling for uncluttered scenes and mesh-based labeling for cluttered scenes.
  • Figure 2: A scene can be labeled using either bounding-boxes or using meshes. Bounding boxes can be used to extract meshes from a scene.
  • Figure 3: NeuralLabeling supports a wide variety of outputs. Circled letter references the scene: (A) Mostly Lambertian objects placed upright for mesh extraction, second row shows the annotated bounding boxes, third row shows the geometry generated using the bounding boxes. (B) Most of the objects from (A) placed in a shopping basket and annotated using the meshes generated from (A), towel was captured separately, second row shows 3D bounding boxes based on the mesh annotations, third row shows 6DOF poses based on the mesh annotations. Second column of (B) shows instance masks, category masks and binary masks, each using NeRF-to-mesh occlusions rendered directly by NeuralLabeling to improve segmentation accuracy. (C) Lambertian objects placed on a lunch plate. We use YCB objects for which we use openly available meshes based on 3D scans using the Google Scanner, second row shows the meshes rendered directly in the scene, third row shows 2D bounding boxes generated based on mesh geometry.
  • Figure 4: Opaque clones of glasses placed up- and down-facing, rendered using NeRF. Using the bounding-box labeling pipeline we extract meshes that are used for annotating the dishwasher scenes.
  • Figure 5: Non-Lambertian objects in a complicated environment, annotated using opaque clone NeRF meshes, second column shows sensor depth estimate using RealSense D415, third column shows estimated object depth based on mesh annotations, fourth column shows the combination of generated depth with noisy sensor depth, which can be used as ground truth data for training a deep neural network.