Table of Contents
Fetching ...

OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, Marc Pollefeys, Federico Tombari

TL;DR

OpenNeRF addresses open-set 3D semantic segmentation by distilling pixel-aligned CLIP features into a NeRF, enabling open-set queries directly in a 3D scene. It leverages NeRF's view synthesis to render novel viewpoints that produce additional open-set cues, guided by uncertainty-based view selection. The method delivers a simpler architecture without multi-scale CLIP or DINO regularization and achieves state-of-the-art results on Replica compared to OpenScene and LERF. This work advances flexible, pixel-accurate open-set 3D understanding with practical implications for robotics and AR/VR.

Abstract

Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF's ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.

OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

TL;DR

OpenNeRF addresses open-set 3D semantic segmentation by distilling pixel-aligned CLIP features into a NeRF, enabling open-set queries directly in a 3D scene. It leverages NeRF's view synthesis to render novel viewpoints that produce additional open-set cues, guided by uncertainty-based view selection. The method delivers a simpler architecture without multi-scale CLIP or DINO regularization and achieves state-of-the-art results on Replica compared to OpenScene and LERF. This work advances flexible, pixel-accurate open-set 3D understanding with practical implications for robotics and AR/VR.

Abstract

Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are heavily influenced by closed-set 3D convolutional approaches that process point clouds or polygon meshes. However, these 3D scene representations do not align well with the image-based nature of the visual-language models. Indeed, point cloud and 3D meshes typically have a lower resolution than images and the reconstructed 3D scene geometry might not project well to the underlying 2D image sequences used to compute pixel-aligned CLIP features. To address these challenges, we propose OpenNeRF which naturally operates on posed images and directly encodes the VLM features within the NeRF. This is similar in spirit to LERF, however our work shows that using pixel-wise VLM features (instead of global CLIP features) results in an overall less complex architecture without the need for additional DINO regularization. Our OpenNeRF further leverages NeRF's ability to render novel views and extract open-set VLM features from areas that are not well observed in the initial posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF outperforms recent open-vocabulary methods such as LERF and OpenScene by at least +4.9 mIoU.
Paper Structure (26 sections, 4 equations, 7 figures, 2 tables)

This paper contains 26 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Open-vocabulary 3D semantic segmentation on point clouds. Compared to LERF kerr2023lerf, the segmentation masks of OpenNeRF are more accurate and better localized, while achieving more fine-grained classification than OpenScene peng2022openscene. Zero-shot results on Replica replica19arxiv.
  • Figure 2: We propose OpenNeRF, an approach for open-set 3D scene understanding based on neural radiance fields. Arbitrary concepts can be queried from our representation (left). As the original camera trajectory (blue, middle) might not capture all interesting scene details, we use NeRFs ability to render novel views (right) and propose a mechanism to obtain relevant novel camera poses (yellow, middle) that focus on scene details from which we can extract additional open-scene features improving the overall open-set scene representation.
  • Figure 3: Confidence Estimation. The error $e_i$(left) correlates well with the estimated uncertainty $u_i$(center). Our mechanism for selecting novel view points is based on the estimated uncertainty. The plot (right) shows the error-uncertainty correlation $r$ for room0 of the Replica replica19arxiv dataset.
  • Figure 4: Class Frequency Distribution of the Replica Dataset replica19arxiv. We show the number of point annotations for each category. The colors indicate the separation in head (blue), common (yellow) and tail (green) classes from left to right in decreasing order. Note that the plot is shown at log-scale.
  • Figure 5: Qualitative 3D Segmentation Results and Comparison with OpenScene peng2022openscene. The white dashed circles indicate the most noticeable differences between both approaches. Color and ground truth are shown for reference only. Overall, our approach produces less noisy segmentation masks.
  • ...and 2 more figures