Table of Contents
Fetching ...

RelationField: Relate Anything in Radiance Fields

Sebastian Koch, Johanna Wald, Mirco Colosi, Narunas Vaskevicius, Pedro Hermosilla, Federico Tombari, Timo Ropinski

TL;DR

RelationField introduces a first open-vocabulary relational reasoning framework for neural radiance fields by adding a dedicated relationship field and distilling inter-object knowledge from multimodal LLMs. It enables open-vocabulary object and relationship queries, producing state-of-the-art 3D scene graphs and a new task of relationship-guided 3D instance segmentation. The approach combines two-step querying, SoM prompting, and cross-modal supervision to render relationship features aligned with 3D geometry. This work demonstrates that 3D-consistent relational reasoning in radiance fields yields tangible gains over 2D-only inferences and opens avenues for richer, text-driven scene understanding without explicit 3D meshes.

Abstract

Neural radiance fields are an emerging 3D scene representation and recently even been extended to learn features for scene understanding by distilling open-vocabulary features from vision-language models. However, current method primarily focus on object-centric representations, supporting object segmentation or detection, while understanding semantic relationships between objects remains largely unexplored. To address this gap, we propose RelationField, the first method to extract inter-object relationships directly from neural radiance fields. RelationField represents relationships between objects as pairs of rays within a neural radiance field, effectively extending its formulation to include implicit relationship queries. To teach RelationField complex, open-vocabulary relationships, relationship knowledge is distilled from multi-modal LLMs. To evaluate RelationField, we solve open-vocabulary 3D scene graph generation tasks and relationship-guided instance segmentation, achieving state-of-the-art performance in both tasks. See the project website at https://relationfield.github.io.

RelationField: Relate Anything in Radiance Fields

TL;DR

RelationField introduces a first open-vocabulary relational reasoning framework for neural radiance fields by adding a dedicated relationship field and distilling inter-object knowledge from multimodal LLMs. It enables open-vocabulary object and relationship queries, producing state-of-the-art 3D scene graphs and a new task of relationship-guided 3D instance segmentation. The approach combines two-step querying, SoM prompting, and cross-modal supervision to render relationship features aligned with 3D geometry. This work demonstrates that 3D-consistent relational reasoning in radiance fields yields tangible gains over 2D-only inferences and opens avenues for richer, text-driven scene understanding without explicit 3D meshes.

Abstract

Neural radiance fields are an emerging 3D scene representation and recently even been extended to learn features for scene understanding by distilling open-vocabulary features from vision-language models. However, current method primarily focus on object-centric representations, supporting object segmentation or detection, while understanding semantic relationships between objects remains largely unexplored. To address this gap, we propose RelationField, the first method to extract inter-object relationships directly from neural radiance fields. RelationField represents relationships between objects as pairs of rays within a neural radiance field, effectively extending its formulation to include implicit relationship queries. To teach RelationField complex, open-vocabulary relationships, relationship knowledge is distilled from multi-modal LLMs. To evaluate RelationField, we solve open-vocabulary 3D scene graph generation tasks and relationship-guided instance segmentation, achieving state-of-the-art performance in both tasks. See the project website at https://relationfield.github.io.

Paper Structure

This paper contains 19 sections, 4 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Open-Vocabulary Relationship Understanding. We propose RelationField, the first framework to extract open-vocabulary inter-object relationships directly from neural radiance fields. RelationField can answer a wide variety of relationship queries, such as "composition", "compare", "spatial", "affordance" and "support" relationships.
  • Figure 2: RelationField Training.Left: RelationField learns a 3D feature field (a) that can be queried with a relationship query location (b) which changes the relationship field of the 3D volume depending on what position is selected. The relationship feature is sampled and rendered along a ray according to NeRF's rendering weights. The language loss maximizes the cosine similarity between the extracted sparse features from the 2D views and the rendered 3D relationship features. Right: We estimate 2D relationship proposals from a multi-model LLM prompted with SoM (e) for each training view and encode extracted textual relationship description into the image plane (d). A pair pixel sampler samples subject and object pixels (c) for which the relationship feature is distilled into the 3D volume.
  • Figure 3: Results with RelationField in 4 in-the-wild scenes. Each image shows a rendering from RelationField, along with the relationship response for each query relationship. The relevancy score describes the answer of the model to the question: What is standing on/attached to/similar to etc.? For demonstration purposes, we highlight the click as well as the outline of the clicked object, which is not needed when querying the model. Our model is able to understand complex relationships, such as the functionality of light switches or uncommon support structures, such as "knives hanging on a magnetic mount".
  • Figure 4: 3D Scene Graph Prediction. Our open-vocabulary approach is able to predict complete 3D scene graph edges containing a subject-predicate-object relationship.
  • Figure 5: 3D Consistency Ablation.Left: Extracted SoM marks per image with query. Center: Existing relationship in GPT-4 caption. Right: Relationship response from RelationField rendered into image space. While GPT-4 struggles with partially visible objects, RelationField produces more robust results, independent of the view, because our volumetric rendering incorporates information from multiple views and models the underlying 3D relationship representation.
  • ...and 8 more figures