Table of Contents
Fetching ...

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

Zhiwen Fan, Peihao Wang, Yifan Jiang, Xinyu Gong, Dejia Xu, Zhangyang Wang

TL;DR

NeRF-SOS addresses the problem of obtaining object-level segmentation in complex real-world scenes without manual annotations by coupling a segmentation field to NeRF and enforcing cross-view consistency through a collaborative appearance-geometry contrastive loss. The method leverages a self-supervised 2D backbone (DINO-ViT) for appearance while exploiting NeRF density as a geometry cue, distilling both into a compact segmentation field that enables view-consistent masks via clustering. It introduces two contrastive streams: an appearance-based loss $oxed{\mathcal{L}_{app}}$ and a geometry-based loss $oxed{\mathcal{L}_{geo}}$, combined with the standard photometric objective, under a stride-ray sampling regime to train on patch-scale views. Empirical results across LLFF, BlendedMVS, CO3Dv2, and Tank & Temples show NeRF-SOS beating 2D self-supervised baselines and matching or exceeding supervised Semantic-NeRF in many settings, delivering finer segmentation details while maintaining high-quality novel-view synthesis.

Abstract

Neural volumetric representations have shown the potential that Multi-layer Perceptrons (MLPs) can be optimized with multi-view calibrated images to represent scene geometry and appearance, without explicit 3D supervision. Object segmentation can enrich many downstream applications based on the learned radiance field. However, introducing hand-crafted segmentation to define regions of interest in a complex real-world scene is non-trivial and expensive as it acquires per view annotation. This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object Segmentation NeRF-SOS, couples object segmentation and neural radiance field to segment objects in any view within a scene. By proposing a novel collaborative contrastive loss in both appearance and geometry levels, NeRF-SOS encourages NeRF models to distill compact geometry-aware segmentation clusters from their density fields and the self-supervised pre-trained 2D visual features. The self-supervised object segmentation framework can be applied to various NeRF models that both lead to photo-realistic rendering results and convincing segmentation maps for both indoor and outdoor scenarios. Extensive results on the LLFF, Tank & Temple, and BlendedMVS datasets validate the effectiveness of NeRF-SOS. It consistently surpasses other 2D-based self-supervised baselines and predicts finer semantics masks than existing supervised counterparts. Please refer to the video on our project page for more details:https://zhiwenfan.github.io/NeRF-SOS.

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

TL;DR

NeRF-SOS addresses the problem of obtaining object-level segmentation in complex real-world scenes without manual annotations by coupling a segmentation field to NeRF and enforcing cross-view consistency through a collaborative appearance-geometry contrastive loss. The method leverages a self-supervised 2D backbone (DINO-ViT) for appearance while exploiting NeRF density as a geometry cue, distilling both into a compact segmentation field that enables view-consistent masks via clustering. It introduces two contrastive streams: an appearance-based loss and a geometry-based loss , combined with the standard photometric objective, under a stride-ray sampling regime to train on patch-scale views. Empirical results across LLFF, BlendedMVS, CO3Dv2, and Tank & Temples show NeRF-SOS beating 2D self-supervised baselines and matching or exceeding supervised Semantic-NeRF in many settings, delivering finer segmentation details while maintaining high-quality novel-view synthesis.

Abstract

Neural volumetric representations have shown the potential that Multi-layer Perceptrons (MLPs) can be optimized with multi-view calibrated images to represent scene geometry and appearance, without explicit 3D supervision. Object segmentation can enrich many downstream applications based on the learned radiance field. However, introducing hand-crafted segmentation to define regions of interest in a complex real-world scene is non-trivial and expensive as it acquires per view annotation. This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object Segmentation NeRF-SOS, couples object segmentation and neural radiance field to segment objects in any view within a scene. By proposing a novel collaborative contrastive loss in both appearance and geometry levels, NeRF-SOS encourages NeRF models to distill compact geometry-aware segmentation clusters from their density fields and the self-supervised pre-trained 2D visual features. The self-supervised object segmentation framework can be applied to various NeRF models that both lead to photo-realistic rendering results and convincing segmentation maps for both indoor and outdoor scenarios. Extensive results on the LLFF, Tank & Temple, and BlendedMVS datasets validate the effectiveness of NeRF-SOS. It consistently surpasses other 2D-based self-supervised baselines and predicts finer semantics masks than existing supervised counterparts. Please refer to the video on our project page for more details:https://zhiwenfan.github.io/NeRF-SOS.
Paper Structure (39 sections, 8 equations, 15 figures, 5 tables)

This paper contains 39 sections, 8 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Visual examples. From left to right: ground truth color images, annotated object masks, object masks rendered by NeRF-SOS, 2D image co-segmentation using DINO amir2021deep, and object masks rendered by Semantic-NeRF zhi2021place, respectively. Compared to the previous methods, NeRF-SOS generates faithful object masks with finer local details.
  • Figure 2: The overall pipeline of the proposed NeRF-SOS. Input with rays cast from multiple views, we render the corresponding color patch ($\boldsymbol{c}$), segmentation patch ($s$), and depth patch ($\sigma$). Then, appearance-segmentation correlations and geometry-segmentation correlations are used to formulate a collaborative contrastive loss, enabling NeRF-SOS to render object masks from any viewpoint using the distilled segmentation field.
  • Figure 3: Cosine similarity matrix calculated on scene Fortress.
  • Figure 4: Qualitative results on scene Flower and Fortress of LLFF dataset. In the fourth column, DINO-CoSeg mistakenly matches several discrete patches, as DINO has higher activation on just a few tokens, which may lead to view-inconsistent and disconnected co-segmentation results. $*$ superscript denotes the supervised method. DOCS and DINO-CoSeg are not able to perform novel view synthesis, and thus we perform rendering before segmentation using a vanilla NeRF. Videos can be viewed in the supplementary materials.
  • Figure 5: Novel view object segmentation results on object-centric datasets: BlendedMVS (the 1st row) and CO3Dv2 (the 2nd row). NeRF-SOS (the 3rd column) still produces view-consistent masks with finer details. Videos can be viewed in the supplementary materials.
  • ...and 10 more figures