Table of Contents
Fetching ...

Enforcing View-Consistency in Class-Agnostic 3D Segmentation Fields

Corentin Dumery, Aoxiang Fan, Ren Li, Nicolas Talabot, Pascal Fua

TL;DR

DiscoNeRF tackles the challenge of panoptic 3D segmentation in radiance fields by learning a class-agnostic object field that competes across a fixed set of object slots. It introduces a robust affinity-based Hungarian matching between 2D masks and 3D object channels and enforces spatial coherence with a total-variation regularizer, enabling view-consistent segmentation from inconsistent supervision. The approach yields sharp 3D segmentations and supports downstream tasks such as 3D asset extraction and cross-scene object composition, outperforming state-of-the-art baselines on the Mip-NeRF 360 dataset. Limitations include difficulties with thin or highly reflective objects and potential non-contiguous label ambiguity; future work aims to extend to dynamic scenes and other 3D representations.

Abstract

Radiance Fields have become a powerful tool for modeling 3D scenes from multiple images. However, they remain difficult to segment into semantically meaningful regions. Some methods work well using 2D semantic masks, but they generalize poorly to class-agnostic segmentations. More recent methods circumvent this issue by using contrastive learning to optimize a high-dimensional 3D feature field instead. However, recovering a segmentation then requires clustering and fine-tuning the associated hyperparameters. In contrast, we aim to identify the necessary changes in segmentation field methods to directly learn a segmentation field while being robust to inconsistent class-agnostic masks, successfully decomposing the scene into a set of objects of any class. By introducing an additional spatial regularization term and restricting the field to a limited number of competing object slots against which masks are matched, a meaningful object representation emerges that best explains the 2D supervision. Our experiments demonstrate the ability of our method to generate 3D panoptic segmentations on complex scenes, and extract high-quality 3D assets from radiance fields that can then be used in virtual 3D environments.

Enforcing View-Consistency in Class-Agnostic 3D Segmentation Fields

TL;DR

DiscoNeRF tackles the challenge of panoptic 3D segmentation in radiance fields by learning a class-agnostic object field that competes across a fixed set of object slots. It introduces a robust affinity-based Hungarian matching between 2D masks and 3D object channels and enforces spatial coherence with a total-variation regularizer, enabling view-consistent segmentation from inconsistent supervision. The approach yields sharp 3D segmentations and supports downstream tasks such as 3D asset extraction and cross-scene object composition, outperforming state-of-the-art baselines on the Mip-NeRF 360 dataset. Limitations include difficulties with thin or highly reflective objects and potential non-contiguous label ambiguity; future work aims to extend to dynamic scenes and other 3D representations.

Abstract

Radiance Fields have become a powerful tool for modeling 3D scenes from multiple images. However, they remain difficult to segment into semantically meaningful regions. Some methods work well using 2D semantic masks, but they generalize poorly to class-agnostic segmentations. More recent methods circumvent this issue by using contrastive learning to optimize a high-dimensional 3D feature field instead. However, recovering a segmentation then requires clustering and fine-tuning the associated hyperparameters. In contrast, we aim to identify the necessary changes in segmentation field methods to directly learn a segmentation field while being robust to inconsistent class-agnostic masks, successfully decomposing the scene into a set of objects of any class. By introducing an additional spatial regularization term and restricting the field to a limited number of competing object slots against which masks are matched, a meaningful object representation emerges that best explains the 2D supervision. Our experiments demonstrate the ability of our method to generate 3D panoptic segmentations on complex scenes, and extract high-quality 3D assets from radiance fields that can then be used in virtual 3D environments.
Paper Structure (24 sections, 7 equations, 10 figures, 2 tables)

This paper contains 24 sections, 7 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Problem statement. (a) Given as input a set of class-agnostic 2D masks Kirillov23 with little consistency across views, we aim to learn (b) a meaningful 3D object field that segments the different instances in the scene. The discovered objects can then be extracted and rendered independently.
  • Figure 2: DiscoNeRF pipeline. We add to a radiance field (green) an object network (violet) that predicts probabilities of belonging to each object. These are used to render 2D probability images, which are compared to the segmentation masks (blue) produced by a foundation model Kirillov23. We introduce three losses $\mathcal{L}_{TV}, \mathcal{L}_{\gamma}$ and $\mathcal{L}_{FP}$, designed to be robust to the inconsistency in the supervision signal.
  • Figure 3: Our affinity function $\alpha$ considers (a) a generated mask $M_m$ and (b) all object field channels $(O_n)_{1\leq n \leq N}$ independently. Bottom row: visualization of $1 - |M_m - O_n|$.
  • Figure 4: Comparing 3D radiance field decomposition models for the scenes of counter and room.
  • Figure 5: Ablation. Top row: the prediction confidence is defined as the maximum value across object slots. Without matching, objects are often missed or captured within the same slots. Without $\mathcal{L}_{TV}$, the segmentation is not spatially consistent and its confidence is noisy.
  • ...and 5 more figures