Table of Contents
Fetching ...

UniFField: A Generalizable Unified Neural Feature Field for Visual, Semantic, and Spatial Uncertainties in Any Scene

Christian Maurer, Snehal Jauhri, Sophie Lueth, Georgia Chalvatzaki

TL;DR

UniFField introduces a generalizable, uncertainty-aware unified neural feature field that fuses visual, semantic, and geometric cues into a voxel-based 3D representation, enabling zero-shot deployment and incremental updates during scene exploration. It uses depth-guided fusion, MaskCLIP-based semantic distillation, and differentiable volume rendering with a heteroscedastic loss to predict both means and uncertainties for color, semantics, and geometry. Experiments on unseen ScanNet scenes demonstrate alignment with ground-truth properties and reliable uncertainty predictions, while a real-world active object search on a mobile manipulator showcases robust decision-making under partial observability. The work advances robust robotic perception by integrating multimodal priors and uncertainty into a single, queryable 3D field, though scaling and multiplicative uncertainty strategies warrant further refinement.

Abstract

Comprehensive visual, geometric, and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions, it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage our feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.

UniFField: A Generalizable Unified Neural Feature Field for Visual, Semantic, and Spatial Uncertainties in Any Scene

TL;DR

UniFField introduces a generalizable, uncertainty-aware unified neural feature field that fuses visual, semantic, and geometric cues into a voxel-based 3D representation, enabling zero-shot deployment and incremental updates during scene exploration. It uses depth-guided fusion, MaskCLIP-based semantic distillation, and differentiable volume rendering with a heteroscedastic loss to predict both means and uncertainties for color, semantics, and geometry. Experiments on unseen ScanNet scenes demonstrate alignment with ground-truth properties and reliable uncertainty predictions, while a real-world active object search on a mobile manipulator showcases robust decision-making under partial observability. The work advances robust robotic perception by integrating multimodal priors and uncertainty into a single, queryable 3D field, though scaling and multiplicative uncertainty strategies warrant further refinement.

Abstract

Comprehensive visual, geometric, and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions, it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage our feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.

Paper Structure

This paper contains 11 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of UniFField. Given a sequence of RGB-D reference frames of a scene, we combine image features $\mathcal{V}^c$, an initial TSDF volume $\mathcal{V}^d$, and uncertainty indicators $\mathcal{V}^u$ to construct a unified feature volume $\mathcal{V}^{\Psi}$. We employ knowledge distillation of a teacher model $\mathcal{F}$, novel view synthesis, and geometric reconstruction as pre-training objectives to build the generalizable model. At test time, the model generates visual, spatial, and semantic scene properties, along with their associated uncertainty.
  • Figure 2: Novel view synthesis. Here, NeRF is trained on 1658 reference frames, while our approach merges the feature volumes from reference frames without any optimization.
  • Figure 5: 2D uncertainty. We compare different types and modalities of uncertainty against the prediction error. Visual uncertainty is most pronounced at the boundaries of objects, particularly in areas of high contrast differences. Semantic uncertainty is distributed across entire objects. Spatial uncertainty is most pronounced at object boundaries, where there is high depth contrast. The highest errors and uncertainties are colored yellow.
  • Figure 6: 3D spatial uncertainty. We show slices of the voxel volumes at a constant height of $z=1.25$ meters. Predicted uncertainty closely matches the TSDF error, while dropout-based uncertainty can detect errors caused by missing observations (red box). The highest errors and uncertainties are colored yellow.
  • Figure 7: 2D and 3D uncertainty. Our model preserves spatial consistency in the predicted uncertainty and allows for 2D and 3D uncertainty estimation. The visualization is obtained by predicting uncertainties at 3D positions and mapping onto the nearest surface extracted from the predicted TSDF.
  • ...and 2 more figures