Table of Contents
Fetching ...

Open Scene Graphs for Open World Object-Goal Navigation

Joel Loo, Zhanxin Wu, David Hsu

TL;DR

The paper introduces Open Scene Graphs (OSGs) as a configurable topo-semantic memory for open-world object-goal navigation and presents OpenSearch, a system that composes foundation models to perform zero-shot, open-vocabulary searches across diverse environments and robot embodiments. OSGs structure open-set scene information into a layered, hierarchical graph that can be instantiated per environment type, enabling LLM-based reasoning to plan, explore, and navigate toward target objects specified in natural language. The approach is validated through simulation and real-world experiments, showing improved reasoning and generalisation over existing LLM-based methods and demonstrating zero-shot robustness to new environments, robots, and instructions. Limitations include computational cost and lack of explicit uncertainty handling, with future work pointing to online OSG spec inference and more efficient inference for broader real-world applicability.

Abstract

How can we build robots for open-world semantic navigation tasks, like searching for target objects in novel scenes? While foundation models have the rich knowledge and generalisation needed for these tasks, a suitable scene representation is needed to connect them into a complete robot system. We address this with Open Scene Graphs (OSGs), a topo-semantic representation that retains and organises open-set scene information for these models, and has a structure that can be configured for different environment types. We integrate foundation models and OSGs into the OpenSearch system for Open World Object-Goal Navigation, which is capable of searching for open-set objects specified in natural language, while generalising zero-shot across diverse environments and embodiments. Our OSGs enhance reasoning with Large Language Models (LLM), enabling robust object-goal navigation outperforming existing LLM approaches. Through simulation and real-world experiments, we validate OpenSearch's generalisation across varied environments, robots and novel instructions.

Open Scene Graphs for Open World Object-Goal Navigation

TL;DR

The paper introduces Open Scene Graphs (OSGs) as a configurable topo-semantic memory for open-world object-goal navigation and presents OpenSearch, a system that composes foundation models to perform zero-shot, open-vocabulary searches across diverse environments and robot embodiments. OSGs structure open-set scene information into a layered, hierarchical graph that can be instantiated per environment type, enabling LLM-based reasoning to plan, explore, and navigate toward target objects specified in natural language. The approach is validated through simulation and real-world experiments, showing improved reasoning and generalisation over existing LLM-based methods and demonstrating zero-shot robustness to new environments, robots, and instructions. Limitations include computational cost and lack of explicit uncertainty handling, with future work pointing to online OSG spec inference and more efficient inference for broader real-world applicability.

Abstract

How can we build robots for open-world semantic navigation tasks, like searching for target objects in novel scenes? While foundation models have the rich knowledge and generalisation needed for these tasks, a suitable scene representation is needed to connect them into a complete robot system. We address this with Open Scene Graphs (OSGs), a topo-semantic representation that retains and organises open-set scene information for these models, and has a structure that can be configured for different environment types. We integrate foundation models and OSGs into the OpenSearch system for Open World Object-Goal Navigation, which is capable of searching for open-set objects specified in natural language, while generalising zero-shot across diverse environments and embodiments. Our OSGs enhance reasoning with Large Language Models (LLM), enabling robust object-goal navigation outperforming existing LLM approaches. Through simulation and real-world experiments, we validate OpenSearch's generalisation across varied environments, robots and novel instructions.
Paper Structure (47 sections, 8 figures, 11 tables, 5 algorithms)

This paper contains 47 sections, 8 figures, 11 tables, 5 algorithms.

Figures (8)

  • Figure 1: Open Scene Graphs enable Open World ObjectNav systems. We design the OpenSearch system for open-vocabulary ObjectNav across environments and embodiments. Our OSG acts as a suitable semantic scene memory for the foundation models that provide OpenSearch's semantic understanding and generalisation. Notably, OSGs facilitate generalisation over diverse environments since their structures can be dynamically configured to best represent the current environment type.
  • Figure 2: OpenSearch system overview.OSG Mapper and Reasoner build/reason with the OSG using templated prompts. These prompts are grounded with concepts specified in an OSG spec.
  • Figure 3: Value of topo-semantic information on scene structure to LLM-based ObjectNav. In both cases, there are few objects near the robot's starting position and hence sparse object cues. (a): LFG defaults to metric search, while OS is guided by its region-level spatial understanding to efficiently sweep through the rooms. (b): OS recognises it is starting in a bedroom and searches nearby Connectors, while LFG lacks region/structure understanding and searches in the wrong direction.
  • Figure 4: OSG construction results in 4 HM3D scenes. The built OSGs are largely accurate and representative of the scenes, though there are some errors highlighted with red labels.
  • Figure 5: OSG of a simulated supermarket from Gibson (Gratz). The OSG Mapper is able to use the spatial concepts described in \ref{['listing:osg_spec_supermarket']} to recognise and produce semantically meaningful labels for different aisles (Places) in the supermarket, and can localise and build a coherent OSG of this different scene.
  • ...and 3 more figures