NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

Zhiwen Fan; Peihao Wang; Yifan Jiang; Xinyu Gong; Dejia Xu; Zhangyang Wang

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

Zhiwen Fan, Peihao Wang, Yifan Jiang, Xinyu Gong, Dejia Xu, Zhangyang Wang

TL;DR

NeRF-SOS addresses the problem of obtaining object-level segmentation in complex real-world scenes without manual annotations by coupling a segmentation field to NeRF and enforcing cross-view consistency through a collaborative appearance-geometry contrastive loss. The method leverages a self-supervised 2D backbone (DINO-ViT) for appearance while exploiting NeRF density as a geometry cue, distilling both into a compact segmentation field that enables view-consistent masks via clustering. It introduces two contrastive streams: an appearance-based loss $oxed{\mathcal{L}_{app}}$ and a geometry-based loss $oxed{\mathcal{L}_{geo}}$, combined with the standard photometric objective, under a stride-ray sampling regime to train on patch-scale views. Empirical results across LLFF, BlendedMVS, CO3Dv2, and Tank & Temples show NeRF-SOS beating 2D self-supervised baselines and matching or exceeding supervised Semantic-NeRF in many settings, delivering finer segmentation details while maintaining high-quality novel-view synthesis.

Abstract

Neural volumetric representations have shown the potential that Multi-layer Perceptrons (MLPs) can be optimized with multi-view calibrated images to represent scene geometry and appearance, without explicit 3D supervision. Object segmentation can enrich many downstream applications based on the learned radiance field. However, introducing hand-crafted segmentation to define regions of interest in a complex real-world scene is non-trivial and expensive as it acquires per view annotation. This paper carries out the exploration of self-supervised learning for object segmentation using NeRF for complex real-world scenes. Our framework, called NeRF with Self-supervised Object Segmentation NeRF-SOS, couples object segmentation and neural radiance field to segment objects in any view within a scene. By proposing a novel collaborative contrastive loss in both appearance and geometry levels, NeRF-SOS encourages NeRF models to distill compact geometry-aware segmentation clusters from their density fields and the self-supervised pre-trained 2D visual features. The self-supervised object segmentation framework can be applied to various NeRF models that both lead to photo-realistic rendering results and convincing segmentation maps for both indoor and outdoor scenarios. Extensive results on the LLFF, Tank & Temple, and BlendedMVS datasets validate the effectiveness of NeRF-SOS. It consistently surpasses other 2D-based self-supervised baselines and predicts finer semantics masks than existing supervised counterparts. Please refer to the video on our project page for more details:https://zhiwenfan.github.io/NeRF-SOS.

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

TL;DR

and a geometry-based loss

, combined with the standard photometric objective, under a stride-ray sampling regime to train on patch-scale views. Empirical results across LLFF, BlendedMVS, CO3Dv2, and Tank & Temples show NeRF-SOS beating 2D self-supervised baselines and matching or exceeding supervised Semantic-NeRF in many settings, delivering finer segmentation details while maintaining high-quality novel-view synthesis.

Abstract

Paper Structure (39 sections, 8 equations, 15 figures, 5 tables)

This paper contains 39 sections, 8 equations, 15 figures, 5 tables.

Introduction
Related Work
Neural Radiance Fields
Object Co-segmentation without Explicit Learning
Method
Overview
Preliminaries
Neural Radiance Fields
Cross View Appearance Correspondence
Semantic Correspondence across Views
Distilling Semantic Correspondence into Segmentation Field
Discover Patch Relationships
Cross View Geometry Correspondence
Geometry Correspondence across Views
Injecting Geometry Coherence into Segmentation Field
...and 24 more sections

Figures (15)

Figure 1: Visual examples. From left to right: ground truth color images, annotated object masks, object masks rendered by NeRF-SOS, 2D image co-segmentation using DINO amir2021deep, and object masks rendered by Semantic-NeRF zhi2021place, respectively. Compared to the previous methods, NeRF-SOS generates faithful object masks with finer local details.
Figure 2: The overall pipeline of the proposed NeRF-SOS. Input with rays cast from multiple views, we render the corresponding color patch ($\boldsymbol{c}$), segmentation patch ($s$), and depth patch ($\sigma$). Then, appearance-segmentation correlations and geometry-segmentation correlations are used to formulate a collaborative contrastive loss, enabling NeRF-SOS to render object masks from any viewpoint using the distilled segmentation field.
Figure 3: Cosine similarity matrix calculated on scene Fortress.
Figure 4: Qualitative results on scene Flower and Fortress of LLFF dataset. In the fourth column, DINO-CoSeg mistakenly matches several discrete patches, as DINO has higher activation on just a few tokens, which may lead to view-inconsistent and disconnected co-segmentation results. $*$ superscript denotes the supervised method. DOCS and DINO-CoSeg are not able to perform novel view synthesis, and thus we perform rendering before segmentation using a vanilla NeRF. Videos can be viewed in the supplementary materials.
Figure 5: Novel view object segmentation results on object-centric datasets: BlendedMVS (the 1st row) and CO3Dv2 (the 2nd row). NeRF-SOS (the 3rd column) still produces view-consistent masks with finer details. Videos can be viewed in the supplementary materials.
...and 10 more figures

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

TL;DR

Abstract

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (15)