Table of Contents
Fetching ...

Towards Learning a Generalizable 3D Scene Representation from 2D Observations

Martin Gromniak, Jan-Gerrit Habekost, Sebastian Kamp, Sven Magg, Stefan Wermter

TL;DR

This work tackles the challenge of robust 3D scene understanding for robotic manipulation from 2D egocentric observations. It introduces a Generalizable NeRF that constructs occupancy in a global workspace frame and can generalize to unseen object arrangements without finetuning, using flexible multi-view inputs. The approach is validated on the NICOL humanoid, with quantitative comparisons to depth sensor ground truth showing accurate 3D geometry including occluded regions, and a reported reconstruction performance of 26 mm MAE in the referenced setting. Key findings show that increased view diversity and more training scenes improve both depth accuracy and rendering quality, highlighting the method's ability to infer complete 3D occupancy beyond traditional stereo methods and its potential for direct use in manipulation tasks.

Abstract

We introduce a Generalizable Neural Radiance Field approach for predicting 3D workspace occupancy from egocentric robot observations. Unlike prior methods operating in camera-centric coordinates, our model constructs occupancy representations in a global workspace frame, making it directly applicable to robotic manipulation. The model integrates flexible source views and generalizes to unseen object arrangements without scene-specific finetuning. We demonstrate the approach on a humanoid robot and evaluate predicted geometry against 3D sensor ground truth. Trained on 40 real scenes, our model achieves 26mm reconstruction error, including occluded regions, validating its ability to infer complete 3D occupancy beyond traditional stereo vision methods.

Towards Learning a Generalizable 3D Scene Representation from 2D Observations

TL;DR

This work tackles the challenge of robust 3D scene understanding for robotic manipulation from 2D egocentric observations. It introduces a Generalizable NeRF that constructs occupancy in a global workspace frame and can generalize to unseen object arrangements without finetuning, using flexible multi-view inputs. The approach is validated on the NICOL humanoid, with quantitative comparisons to depth sensor ground truth showing accurate 3D geometry including occluded regions, and a reported reconstruction performance of 26 mm MAE in the referenced setting. Key findings show that increased view diversity and more training scenes improve both depth accuracy and rendering quality, highlighting the method's ability to infer complete 3D occupancy beyond traditional stereo methods and its potential for direct use in manipulation tasks.

Abstract

We introduce a Generalizable Neural Radiance Field approach for predicting 3D workspace occupancy from egocentric robot observations. Unlike prior methods operating in camera-centric coordinates, our model constructs occupancy representations in a global workspace frame, making it directly applicable to robotic manipulation. The model integrates flexible source views and generalizes to unseen object arrangements without scene-specific finetuning. We demonstrate the approach on a humanoid robot and evaluate predicted geometry against 3D sensor ground truth. Trained on 40 real scenes, our model achieves 26mm reconstruction error, including occluded regions, validating its ability to infer complete 3D occupancy beyond traditional stereo vision methods.
Paper Structure (9 sections, 1 equation, 4 figures, 1 table)

This paper contains 9 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the proposed neural architecture and data flow.
  • Figure 2: NICOL robot and all involved workspace cameras.
  • Figure 3: Top row: Three source views (Input to our model). Bottom left: Target view RealSense image. Bottom mid: Neural rendering for that view. Bottom right: Depth predictions by our model. Note that the geometry of the lower wing is correctly reconstructed despite not being visible from the input views.
  • Figure 4: Prediction of the 3D occupancy for the same scene as in Figure 3.