Active Semantic Perception
Huayi Tang, Pratik Chaudhari
TL;DR
The paper tackles enabling robots to perform active exploration with semantic understanding by introducing a hierarchical multi-layer scene graph that captures rooms, structures, objects, and negative space. A three-module pipeline combines mapping, LLM-based sampling of plausible unobserved scene completions, and planning to maximize semantic information gain during navigation. Evaluations in realistic HM3D indoor environments show faster, more accurate semantic grounding and efficient room discovery compared with geometry- or semantic-based baselines, using a two-level planner that couples graph-based guidance with local optimization. While demonstrating clear benefits, the approach acknowledges LLM hallucinations and latency, pointing to future work on specialized reasoning models and broader semantic coverage.
Abstract
We develop an approach for active semantic perception which refers to using the semantics of the scene for tasks such as exploration. We build a compact, hierarchical multi-layer scene graph that can represent large, complex indoor environments at various levels of abstraction, e.g., nodes corresponding to rooms, objects, walls, windows etc. as well as fine-grained details of their geometry. We develop a procedure based on large language models (LLMs) to sample plausible scene graphs of unobserved regions that are consistent with partial observations of the scene. These samples are used to compute an information gain of a potential waypoint for sophisticated spatial reasoning, e.g., the two doors in the living room can lead to either a kitchen or a bedroom. We evaluate this approach in complex, realistic 3D indoor environments in simulation. We show using qualitative and quantitative experiments that our approach can pin down the semantics of the environment quicker and more accurately than baseline approaches.
