ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

Kaiwen Zhou; Kaizhi Zheng; Connor Pryor; Yilin Shen; Hongxia Jin; Lise Getoor; Xin Eric Wang

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, Xin Eric Wang

TL;DR

The paper addresses zero-shot object navigation by transferring commonsense knowledge from pre-trained vision-language models and large language models to open-world environments. It introduces ESC, which grounds scenes with GLIP, reasons about object-room relations with an LLM, and translates this knowledge into exploration actions via Probabilistic Soft Logic in a frontier-based planner, all without navigation training. ESC achieves state-of-the-art zero-shot results on MP3D, HM3D, and RoboTHOR, significantly outperforming prior zero-shot baselines and narrowing gaps to supervised methods. The approach demonstrates the value of explicitly leveraging pre-trained commonsense for embodied AI tasks and points to future work in expanding relational knowledge and selective fine-tuning. Overall, ESC offers a training-free, generalizable framework for integrating perception, reasoning, and structured exploration in embodied agents.

Abstract

The ability to accurately locate and navigate to a specific object is a crucial capability for embodied agents that operate in the real world and interact with objects to complete tasks. Such object navigation tasks usually require large-scale training in visual environments with labeled objects, which generalizes poorly to novel objects in unknown environments. In this work, we present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC), that transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience nor any other training on the visual environments. First, ESC leverages a pre-trained vision and language model for open-world prompt-based grounding and a pre-trained commonsense language model for room and object reasoning. Then ESC converts commonsense knowledge into navigation actions by modeling it as soft logic predicates for efficient exploration. Extensive experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines, and achieves new state-of-the-art results for zero-shot object navigation (e.g., 288% relative Success Rate improvement than CoW on MP3D).

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

TL;DR

Abstract

Paper Structure (26 sections, 17 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 17 equations, 5 figures, 5 tables, 1 algorithm.

Introduction
Problem Definition
Our ESC Approach
Open-World Semantic Scene Understanding
Commonsense Reasoning for ObjNav via LLM
Commonsense Guided Exploration
Frontier-based Exploration
Soft Commonsense Constraints
Experimental Setup
Benchmarks and Metrics
Baselines
Implementation Details
Results and Analysis
Result Comparison with SOTA Methods
Ablation Study
...and 11 more sections

Figures (5)

Figure 1: Commonsense reasoning in object navigation. In object navigation, our agent first does a semantic understanding of the current scene (red text in the figure) and then performs commonsense reasoning (blue text in the figure). The agent reasons that a fireplace is likely to be in a living room, so it decides to explore the unobserved part of the living room (the frontier adjacent to the observed part of the living room).
Figure 2: The ESC framework. During navigation, the agent performs scene understanding based on RGB observations and prompts. Meanwhile, the Mapping module constructs a semantic map containing room, object, and frontier information. Conditioned on the goal object and semantic scene information, the agent will then perform commonsense reasoning via a LLM to infer the probable location of the goal object, and select a frontier to explore using PSL.
Figure 3: Comparison of the success rate of each goal category on MP3D between CoW gadre2022cow and ESC.
Figure 4: A demonstration of the success rate of each goal category on HM3D (left) and RoboTHOR (right) datasets of ESC method and CoW.
Figure 5: An example shows how commonsense reasoning helps the agent choose better frontiers that lead the agent to the goal 'toilet'. 'U' means scene understanding and 'R' means commonsense reasoning.

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

TL;DR

Abstract

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)