Table of Contents
Fetching ...

AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation

Jingzhi Huang, Junkai Huang, Haoyang Yang, Haoang Li, Yi Wang

Abstract

Zero-Shot Object Navigation (ZSON) in unknown multi-floor environments presents a significant challenge. Recent methods, mostly based on semantic value greedy waypoint selection, spatial topology-enhanced memory, and Multimodal Large Language Model (MLLM) as a decision-making framework, have led to improvements. However, these architectures struggle to balance exploration and exploitation for ZSON when encountering unseen environments, especially in multi-floor settings, such as robots getting stuck at narrow intersections, endlessly wandering, or failing to find stair entrances. To overcome these challenges, we propose AERR-Nav, a Zero-Shot Object Navigation framework that dynamically adjusts its state based on the robot's environment. Specifically, AERR-Nav has the following two key advantages: (1) An Adaptive Exploration-Recovery-Reminiscing Strategy, enables robots to dynamically transition between three states, facilitating specialized responses to diverse navigation scenarios. (2) An Adaptive Exploration State featuring Fast and Slow-Thinking modes helps robots better balance exploration, exploitation, and higher-level reasoning based on evolving environmental information. Extensive experiments on the HM3D and MP3D benchmarks demonstrate that our AERR-Nav achieves state-of-the-art performance among zero-shot methods. Comprehensive ablation studies further validate the efficacy of our proposed strategy and modules.

AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation

Abstract

Zero-Shot Object Navigation (ZSON) in unknown multi-floor environments presents a significant challenge. Recent methods, mostly based on semantic value greedy waypoint selection, spatial topology-enhanced memory, and Multimodal Large Language Model (MLLM) as a decision-making framework, have led to improvements. However, these architectures struggle to balance exploration and exploitation for ZSON when encountering unseen environments, especially in multi-floor settings, such as robots getting stuck at narrow intersections, endlessly wandering, or failing to find stair entrances. To overcome these challenges, we propose AERR-Nav, a Zero-Shot Object Navigation framework that dynamically adjusts its state based on the robot's environment. Specifically, AERR-Nav has the following two key advantages: (1) An Adaptive Exploration-Recovery-Reminiscing Strategy, enables robots to dynamically transition between three states, facilitating specialized responses to diverse navigation scenarios. (2) An Adaptive Exploration State featuring Fast and Slow-Thinking modes helps robots better balance exploration, exploitation, and higher-level reasoning based on evolving environmental information. Extensive experiments on the HM3D and MP3D benchmarks demonstrate that our AERR-Nav achieves state-of-the-art performance among zero-shot methods. Comprehensive ablation studies further validate the efficacy of our proposed strategy and modules.
Paper Structure (17 sections, 6 equations, 6 figures, 6 tables)

This paper contains 17 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: AERR-Nav pipeline: Robot adopts different strategies in different states. During exploration, the optimization function operates as the cerebellum to execute "fast thinking”, while the MLLM serves as the brain to engage in "slow thinking”. When entrapment occurs, two distinct recovery methods are employed depending on the distance to the target frontier point. After completing the exploration of a single floor, MLLM uses keypoint map to guide robot in autonomously revisiting potential target or stairway entrance.
  • Figure 2: Analysis of examples of exploration strategy: In Example 1, (a) it believes the region behind it has higher value scores and greater uncertainty; (b) after seeing a door, MLLM reasons that a bed is unlikely to be in a bathroom and therefore chooses the hallway; (c) after a period of rapid exploration, it encounters another door and MLLM infers that the bed is more likely to be in a bedroom than in the dining area; and (d) it eventually finds the target bed. In Example 2, (e) upon seeing a door, MLLM judges that the right-side door appears to lead to a bathroom and is thus more likely to contain a toilet; (f) in the area near the right-side door, it does not observe a toilet; fast thinking is triggered and, based on uncertainty, it goes to the left-side region; (g) after a period of fast thinking, the robot originally intended to return to the initial right-side area, but upon seeing a door, MLLM infers that the narrow room may be a bathroom; and (h) it finds the target toilet.
  • Figure 3: Explanation of the recovery state workflow: (a) and (d) illustrate abnormal behaviors caused by frontier points distance. (b)$\rightarrow$(c) demonstrate how to handle overly distant frontier points using a segmented method. (e)$\rightarrow$(f) show how MLLM progressively approaches the target frontier point via fine-grained action adjustments.
  • Figure 4: Explanation of how the reminiscing state facilitates progress in task: (a) visualizes the representation of the KeyPoint Map. (b) illustrates how the MLLM chose to keypoints and locate the staircase entrance. (c) shows the robot descending the stairs after locating the stairway entrance and then continuing to find the target.
  • Figure 5: Case Study1: (a)–(f)/(g)–(l) illustrate the complete process of AERR-Nav/ASCENTStairway searching for the sofa.
  • ...and 1 more figures