GARField: Group Anything with Radiance Fields
Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Goldberg, Matthew Tancik, Angjoo Kanazawa
TL;DR
GARField addresses the ambiguity of multi-level scene grouping by distilling 2D masks into a scale-conditioned 3D affinity field $F_g(x,s)$, enabling hierarchical 3D decomposition via coarse-to-fine clustering. The method combines SAM-derived masks with a contrastive and containment-based training objective, yielding view-consistent groupings across scales and improved 3D completeness over single-view masks. It demonstrates strong qualitative and quantitative results on diverse scenes, enabling automatic or interactive 3D asset extraction and downstream tasks in robotics and dynamic scene understanding. The approach advances 3D scene understanding by providing a principled, multi-scale, and view-consistent hierarchical representation grounded in radiance fields.
Abstract
Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects, objects, and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. See the project website at https://www.garfield.studio/
