View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields
Haodi He, Colton Stearns, Adam W. Harley, Leonidas J. Guibas
TL;DR
This work tackles the challenge of converting noisy, multi-view 2D segmentations from open-vocabulary models into a coherent 3D representation that is both view-consistent and hierarchically organized. It introduces ultrametric feature fields within a Neural Radiance Field (NeRF) to encode a 3D feature space whose segmentation structure emerges at different granularities via thresholding, leveraging an ultrametric distance to ensure transitive clustering. Segmentation is learned through a contrastive objective informed by SAM masks, with hierarchical sampling and a depth-regularization term to improve depth continuity. At inference, 2D and 3D segmentations are obtained by Watershed on the rendered feature maps or the 3D feature field, enabling arbitrary granularity and robust view-consistency, demonstrated on PartNet and Blender-HS datasets with strong quantitative and qualitative results. The approach yields a hierarchy-by-construction, outperforming open-vocabulary 3D segmentation baselines and providing new benchmarks and metrics for evaluating hierarchical 3D segmentation. Key contributions include the ultrametric feature field formulation, a contrastive learning strategy that induces hierarchy, a depth-continuity regularization to enhance geometric plausibility, and a new Blender-HS dataset for hierarchical 3D segmentation evaluation.
Abstract
Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of "coarse" or "fine" granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model's 3D hierarchical segmentations in real world scenes. The code and dataset are available at https://github.com/hardyho/ultrametric_feature_fields
