Table of Contents
Fetching ...

GARField: Group Anything with Radiance Fields

Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Goldberg, Matthew Tancik, Angjoo Kanazawa

TL;DR

GARField addresses the ambiguity of multi-level scene grouping by distilling 2D masks into a scale-conditioned 3D affinity field $F_g(x,s)$, enabling hierarchical 3D decomposition via coarse-to-fine clustering. The method combines SAM-derived masks with a contrastive and containment-based training objective, yielding view-consistent groupings across scales and improved 3D completeness over single-view masks. It demonstrates strong qualitative and quantitative results on diverse scenes, enabling automatic or interactive 3D asset extraction and downstream tasks in robotics and dynamic scene understanding. The approach advances 3D scene understanding by providing a principled, multi-scale, and view-consistent hierarchical representation grounded in radiance fields.

Abstract

Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects, objects, and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. See the project website at https://www.garfield.studio/

GARField: Group Anything with Radiance Fields

TL;DR

GARField addresses the ambiguity of multi-level scene grouping by distilling 2D masks into a scale-conditioned 3D affinity field , enabling hierarchical 3D decomposition via coarse-to-fine clustering. The method combines SAM-derived masks with a contrastive and containment-based training objective, yielding view-consistent groupings across scales and improved 3D completeness over single-view masks. It demonstrates strong qualitative and quantitative results on diverse scenes, enabling automatic or interactive 3D asset extraction and downstream tasks in robotics and dynamic scene understanding. The approach advances 3D scene understanding by providing a principled, multi-scale, and view-consistent hierarchical representation grounded in radiance fields.

Abstract

Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects, objects, and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. See the project website at https://www.garfield.studio/
Paper Structure (34 sections, 25 figures, 1 table)

This paper contains 34 sections, 25 figures, 1 table.

Figures (25)

  • Figure 1: Group Anything with Radiance Fields (GARField) We present GARField, which distills multi-level groups represented as masks into NeRF to create a scale-conditioned 3D affinity field (top left). Once trained, this affinity field can be clustered at a variety of scales to decompose the scene at different levels of granularity, like breaking apart the excavator into its subparts (bottom). 3D assets can be extracted from this hierarchy by extracting every group in the scene automatically or via user clicks, as visualized here (top right).
  • Figure 2: Importance of Scale When Grouping A single point may belong to multiple groups. GARField uses scale-conditioning to reconcile these conflicting signals into one affinity field.
  • Figure 3: GARField Method: (Left) given an input image set, we extract a set of candidate groups by densely querying SAM, and assign each a physical scale by deprojecting depth from the NeRF. These scales are used to train a scale-conditioned affinity field (Right). During training, pairs of sampled rays are pushed apart if they reside in different masks, and pulled together if they land in the same mask. Affinity is supervised only at the scale of each mask, which helps resolve conflicts between them.
  • Figure 4: Densified Scale Supervision: Consider two grapes within a cluster. Naively using scale for contrastive loss supervises affinities only at the grape and grape trio levels, leaving entire intervals unsupervised. In GARField, we densify the supervision by 1) augmenting scale between mask euclidean scales and 2) imposing an auxiliary loss on containment of larger scales.
  • Figure 5: 3D Asset Extraction with Interactive Selection: Users can interactively select view-consistent 3D groups with GARField using a click point and a scale.
  • ...and 20 more figures