Table of Contents
Fetching ...

Active Semantic Perception

Huayi Tang, Pratik Chaudhari

TL;DR

The paper tackles enabling robots to perform active exploration with semantic understanding by introducing a hierarchical multi-layer scene graph that captures rooms, structures, objects, and negative space. A three-module pipeline combines mapping, LLM-based sampling of plausible unobserved scene completions, and planning to maximize semantic information gain during navigation. Evaluations in realistic HM3D indoor environments show faster, more accurate semantic grounding and efficient room discovery compared with geometry- or semantic-based baselines, using a two-level planner that couples graph-based guidance with local optimization. While demonstrating clear benefits, the approach acknowledges LLM hallucinations and latency, pointing to future work on specialized reasoning models and broader semantic coverage.

Abstract

We develop an approach for active semantic perception which refers to using the semantics of the scene for tasks such as exploration. We build a compact, hierarchical multi-layer scene graph that can represent large, complex indoor environments at various levels of abstraction, e.g., nodes corresponding to rooms, objects, walls, windows etc. as well as fine-grained details of their geometry. We develop a procedure based on large language models (LLMs) to sample plausible scene graphs of unobserved regions that are consistent with partial observations of the scene. These samples are used to compute an information gain of a potential waypoint for sophisticated spatial reasoning, e.g., the two doors in the living room can lead to either a kitchen or a bedroom. We evaluate this approach in complex, realistic 3D indoor environments in simulation. We show using qualitative and quantitative experiments that our approach can pin down the semantics of the environment quicker and more accurately than baseline approaches.

Active Semantic Perception

TL;DR

The paper tackles enabling robots to perform active exploration with semantic understanding by introducing a hierarchical multi-layer scene graph that captures rooms, structures, objects, and negative space. A three-module pipeline combines mapping, LLM-based sampling of plausible unobserved scene completions, and planning to maximize semantic information gain during navigation. Evaluations in realistic HM3D indoor environments show faster, more accurate semantic grounding and efficient room discovery compared with geometry- or semantic-based baselines, using a two-level planner that couples graph-based guidance with local optimization. While demonstrating clear benefits, the approach acknowledges LLM hallucinations and latency, pointing to future work on specialized reasoning models and broader semantic coverage.

Abstract

We develop an approach for active semantic perception which refers to using the semantics of the scene for tasks such as exploration. We build a compact, hierarchical multi-layer scene graph that can represent large, complex indoor environments at various levels of abstraction, e.g., nodes corresponding to rooms, objects, walls, windows etc. as well as fine-grained details of their geometry. We develop a procedure based on large language models (LLMs) to sample plausible scene graphs of unobserved regions that are consistent with partial observations of the scene. These samples are used to compute an information gain of a potential waypoint for sophisticated spatial reasoning, e.g., the two doors in the living room can lead to either a kitchen or a bedroom. We evaluate this approach in complex, realistic 3D indoor environments in simulation. We show using qualitative and quantitative experiments that our approach can pin down the semantics of the environment quicker and more accurately than baseline approaches.

Paper Structure

This paper contains 13 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Top (left): A robot exploring an indoor environment using our proposed method. The first 15 exploration steps are highlighted in red, while the remaining steps shown in gray. The robot quickly covers most semantically salient regions in spite of visual occlusions and narrow corridors/doors, after which it explores the semantic details of the scene. Top (right): The final scene graph and mesh constructed after the robot fully explores the environment (visualization using Clio maggio2024clio). Bottom: Our approach consists of three main components. The scene graph construction/mapping module (A) takes RGB-D and pose from Habitat simulator, object segmentation from YoloE, wall segmentation from YOSO as input, and constructs a multi-layer scene graph with four elements: rooms, structures, "nothing" and objects. The reasoning and information gain estimation module (B) provides the scene graph and a user prompt as input to an LLM, to generate putative scene graphs that are consistent with the current scene graph. The panel shows three such samples, with the most informative region highlighted by the purple rectangle. The planning module (C) includes a graph planner that computes a high-level path over the scene graph, which is executed by an occupancy grid-based local planner.
  • Figure 2: The LLM produces a semantically plausible completion of a given scene graph that is also roughly consistent in terms of the geometry, e.g., the bed and nightstand are within the bedroom and next to the wardrobe. There is also some degree semantic inconsistency in the LLM completions, e.g., the bed is unlikely to be at this angle right next to the wall. The red part of the scene graph indicates what the camera expects to observe from a particular viewpoint.
  • Figure 3: Average F1 score and GED as a function of path length. The wider spacing of the dots along the curve for Semantic Exploration method indicates that our approach usually travels farther per step: the mean travel distance per step is 2 m for Frontier, 4.6 m for Semantic Exploration and 4 m for SSMI.
  • Figure 4: Visualization of the paths taken to locate a kitchen: semantic exploration vs. baseline. Semantic exploration follows a shorter trajectory and successfully predicts the kitchen’s location before it is observed.
  • Figure 5: Ablation on scene-graph node types. Constraints are enabled progressively: (A) Baseline; (B) +Nothing (free-space); (C) +Nothing+Structure; (D) +Nothing+Structure+Door.
  • ...and 2 more figures