Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

Imen Mahdi; Matteo Cassinelli; Fabien Despinoy; Tim Welschehold; Abhinav Valada

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

Imen Mahdi, Matteo Cassinelli, Fabien Despinoy, Tim Welschehold, Abhinav Valada

TL;DR

An offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference is proposed, and SymSearch is presented, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks.

Abstract

Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models (LLMs), which are too slow and costly for real-time deployment. We introduce SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search, a novel method that searches directly over 3D scene graphs by assigning utility scores to rooms, frontiers, and objects using relational exploration heuristics such as room-object containment and object-object co-occurrence. To make this practical without sacrificing open-vocabulary generalization, we propose an offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference. Furthermore, we present SymSearch, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks. Extensive evaluations across symbolic and simulation environments show that SCOUT outperforms embedding similarity-based methods and matches LLM-level performance while remaining computationally efficient. Finally, real-world experiments demonstrate effective transfer to physical environments, enabling open-world interactive object search under realistic sensing and navigation constraints.

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

TL;DR

Abstract

Paper Structure (37 sections, 16 equations, 14 figures, 11 tables)

This paper contains 37 sections, 16 equations, 14 figures, 11 tables.

Introduction
Related Work
Problem Formulation
SCOUT: Scene Graph-Based Exploration with Learned Utility
3DSG Construction
Estimating Utility via Exploration Heuristics
Rooms
Objects and Containers
Frontiers
Learning Relational Semantics via Structured Knowledge Distillation
Selecting and Grounding High Level Actions
Symbolic Object Search on 3D Scene Graphs
3D Scene-Graph Construction from Real Indoor Scans
Environment Roll-out
Experimental Evaluation
...and 22 more sections

Figures (14)

Figure 1: Overview of SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search. SCOUT procedurally distills structured, relational semantic knowledge between scene elements from large language models into lightweight models (1–2). During exploration, the agent assigns utility scores to scene graph nodes based on exploration heuristics previously learned (3) and grounds high-level actions through low-level navigation and manipulation policies (4). SymSearch, our symbolic benchmark (5), enables scalable evaluation of relational semantic reasoning over scene graphs with no simulation overhead.
Figure 2: Full pipeline of SCOUT illustrated. From left to right: 3DSG is constructed online from raw observations. Scene graph nodes are scored based on their utility in finding the query. Once the node to explore is selected, its affordances determine which low-level policies to run.
Figure 3: Illustration of SymSearch's roll-out process. At each step, the agent receives a scene-graph observation and a target object query. (a) At initialization, the agent spawns in a random region and room. Unexplored regions appear as frontier nodes, and rooms connected via doors are seen but unexplored. (b) Exploring a new room reveals the closest region to the agent in that room, along with its objects. The remaining regions are considered unexplored frontiers. (c) Exploring a frontier reveals the objects within that region. (d) Exploring an object reveals its nested objects (e.g., objects inside or on top of it). The episode terminates when the target object becomes visible.
Figure 4: Comparison of embedding similarity distributions with our learned relational scoring models. Synonym pairs are separable across all models, whereas co-occurrence and containment relationships are not. Our learned models produce substantially stronger separation.
Figure 5: Success rate over an increasing number of steps on SymSearch. Our method closely follows the performance of LLM-based baselines throughout the roll-out.
...and 9 more figures

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

TL;DR

Abstract

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

Authors

TL;DR

Abstract

Table of Contents

Figures (14)