Table of Contents
Fetching ...

Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

Roman Kueble, Marco Hueller, Mrunmai Phatak, Rainer Lienhart, Joerg Haehner

Abstract

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21\% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness--efficiency trade-off.

Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

Abstract

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21\% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness--efficiency trade-off.

Paper Structure

This paper contains 47 sections, 15 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview of the navigation model architecture used in our learning setup: multimodal encoders $\rightarrow$ LSTM policy core $\rightarrow$ action heads (single-head or multi-head) and optional collision auxiliary head.
  • Figure 2: Learning curves over training blocks. Node Recall and Episodic Return are evaluated on the held-out FloorPlans 28--30 every 50 blocks, while Move Success Rate is computed on the training scenes after each block; shaded areas denote the 95% confidence interval across seeds. SH/MH: single-head/multi-head policy; D: depth input; CL: curriculum learning; IL: imitation learning.
  • Figure 3: Evaluation scatter plot relating traversed path length to Node Recall and episode length; each point represents one trained run (one seed) at the final evaluation checkpoint.
  • Figure 4: Representative trajectories on evaluation scenes (FloorPlans 28--30). Legends report episode length and Node Recall for the shown episodes. All plotted scenarios except the Baseline use depth input.
  • Figure : FloorPlan 28
  • ...and 3 more figures