Table of Contents
Fetching ...

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

Haodi He, Colton Stearns, Adam W. Harley, Leonidas J. Guibas

TL;DR

This work tackles the challenge of converting noisy, multi-view 2D segmentations from open-vocabulary models into a coherent 3D representation that is both view-consistent and hierarchically organized. It introduces ultrametric feature fields within a Neural Radiance Field (NeRF) to encode a 3D feature space whose segmentation structure emerges at different granularities via thresholding, leveraging an ultrametric distance to ensure transitive clustering. Segmentation is learned through a contrastive objective informed by SAM masks, with hierarchical sampling and a depth-regularization term to improve depth continuity. At inference, 2D and 3D segmentations are obtained by Watershed on the rendered feature maps or the 3D feature field, enabling arbitrary granularity and robust view-consistency, demonstrated on PartNet and Blender-HS datasets with strong quantitative and qualitative results. The approach yields a hierarchy-by-construction, outperforming open-vocabulary 3D segmentation baselines and providing new benchmarks and metrics for evaluating hierarchical 3D segmentation. Key contributions include the ultrametric feature field formulation, a contrastive learning strategy that induces hierarchy, a depth-continuity regularization to enhance geometric plausibility, and a new Blender-HS dataset for hierarchical 3D segmentation evaluation.

Abstract

Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of "coarse" or "fine" granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model's 3D hierarchical segmentations in real world scenes. The code and dataset are available at https://github.com/hardyho/ultrametric_feature_fields

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

TL;DR

This work tackles the challenge of converting noisy, multi-view 2D segmentations from open-vocabulary models into a coherent 3D representation that is both view-consistent and hierarchically organized. It introduces ultrametric feature fields within a Neural Radiance Field (NeRF) to encode a 3D feature space whose segmentation structure emerges at different granularities via thresholding, leveraging an ultrametric distance to ensure transitive clustering. Segmentation is learned through a contrastive objective informed by SAM masks, with hierarchical sampling and a depth-regularization term to improve depth continuity. At inference, 2D and 3D segmentations are obtained by Watershed on the rendered feature maps or the 3D feature field, enabling arbitrary granularity and robust view-consistency, demonstrated on PartNet and Blender-HS datasets with strong quantitative and qualitative results. The approach yields a hierarchy-by-construction, outperforming open-vocabulary 3D segmentation baselines and providing new benchmarks and metrics for evaluating hierarchical 3D segmentation. Key contributions include the ultrametric feature field formulation, a contrastive learning strategy that induces hierarchy, a depth-continuity regularization to enhance geometric plausibility, and a new Blender-HS dataset for hierarchical 3D segmentation evaluation.

Abstract

Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of "coarse" or "fine" granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model's 3D hierarchical segmentations in real world scenes. The code and dataset are available at https://github.com/hardyho/ultrametric_feature_fields
Paper Structure (52 sections, 8 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 52 sections, 8 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Our method takes as input multi-view posed images, paired with segmentation masks from the recent "Segment Anything Model" (SAM), and merges these into a coherent 3D representation where segmentation is view-consistent and hierarchical.
  • Figure 2: Method Overview: We train a NeRF with an ultrametric feature field using images and view-inconsistent segmentation masks from SAM kirillov2023segment. After training, we use the depth estimation and feature maps from training views to construct a 3D point cloud. At inference, for a specified threshold $t$ representing the granularity level, we apply a 3D watershed transform to segment the 3D point cloud. Then, we can query the point clouds in novel views and obtain view-consistent segmentation results.
  • Figure 3: Ultrametric Segmentation: Left: We overlay a simple graph on the image, showing edge lengths corresponding to feature distances between points. Right: The hierarchical segmentation derived from the graph on the left. The numbers on the tree indicate the ultrametric distance between nodes on the two branches.
  • Figure 4: Hierarchical Segmentation: Our method can hierarchically segment real world scenes at various levels of granularity.
  • Figure 5: Depth Continuity: Our depth continuity loss (labeled as DC above) leads to smoother and more plausible depth estimation, and can be seamlessly combined with additional depth cues such as COLMAP schoenberger2016mvs, resulting in even greater accuracy.
  • ...and 7 more figures