Table of Contents
Fetching ...

SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

Hang Yin, Xiuwei Xu, Zhenyu Wu, Jie Zhou, Jiwen Lu

TL;DR

SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark, and a re-perception mechanism to empower the object navigation framework with the ability to correct perception error is designed.

Abstract

In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark.

SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

TL;DR

SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark, and a re-perception mechanism to empower the object navigation framework with the ability to correct perception error is designed.

Abstract

In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark.

Paper Structure

This paper contains 18 sections, 10 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Different from previous zero-shot object navigation methods yu2023l3mvnzhou2023esc that directly prompt LLM with text of nearby object categories, we construct a hierarchical 3D scene graph to represent the observed environment and prompts LLM to fully exploit the structure information in the graph. Our SG-Nav preserves fine-grained scene context and makes reasonable and explainable decisions.
  • Figure 2: Pipeline of SG-Nav. We construct a hierarchical 3D scene graph as well as an occupancy map online. At each step, we divide the scene graph into several subgraphs, each of which is prompted to LLM with a hierarchical chain-of-thought for structural understanding of the scene context. We interpolate the probability score of each subgraph to the frontiers and select the frontier with highest score for exploration. This decision is also explainable by summarizing the reasoning process of the LLM. With the scene graph representation, we further design a re-perception mechanism, which helps the agent give up false positive goal object by continuous credibility judgement.
  • Figure 3: The incremental generation of edges. We densely connect newly registered nodes (purple) to all other nodes by efficiently prompting the LLM. We divide the edges into long edges and short edges and prune less informative ones with different strategies.
  • Figure 4: Per category SR on MP3D.
  • Figure 5: Time cost of connecting $n$ edges.
  • ...and 7 more figures