Table of Contents
Fetching ...

Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill

Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, Hao Dong

TL;DR

This work introduces PixNav, a pixel-guided navigation skill that enables zero-shot object navigation using only RGB inputs. By treating a pixel in the initial frame as the navigation goal, PixNav replaces traditional map-based planning with a transformer-based policy, augmented by foundation-model perception and an LLM planner for long-horizon exploration. The approach is trained on large-scale RGB trajectories and validated in HM3D and real-world settings, achieving competitive zero-shot performance and demonstrating robustness to camera variations. The study highlights the practical potential of integrating pixel-level goals, vision-language perception, and structured LLM planning to bridge foundation models with embodied navigation.

Abstract

Zero-shot object navigation is a challenging task for home-assistance robots. This task emphasizes visual grounding, commonsense inference and locomotion abilities, where the first two are inherent in foundation models. But for the locomotion part, most works still depend on map-based planning approaches. The gap between RGB space and map space makes it difficult to directly transfer the knowledge from foundation models to navigation tasks. In this work, we propose a Pixel-guided Navigation skill (PixNav), which bridges the gap between the foundation models and the embodied navigation task. It is straightforward for recent foundation models to indicate an object by pixels, and with pixels as the goal specification, our method becomes a versatile navigation policy towards all different kinds of objects. Besides, our PixNav is a pure RGB-based policy that can reduce the cost of home-assistance robots. Experiments demonstrate the robustness of the PixNav which achieves 80+% success rate in the local path-planning task. To perform long-horizon object navigation, we design an LLM-based planner to utilize the commonsense knowledge between objects and rooms to select the best waypoint. Evaluations across both photorealistic indoor simulators and real-world environments validate the effectiveness of our proposed navigation strategy. Code and video demos are available at https://github.com/wzcai99/Pixel-Navigator.

Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill

TL;DR

This work introduces PixNav, a pixel-guided navigation skill that enables zero-shot object navigation using only RGB inputs. By treating a pixel in the initial frame as the navigation goal, PixNav replaces traditional map-based planning with a transformer-based policy, augmented by foundation-model perception and an LLM planner for long-horizon exploration. The approach is trained on large-scale RGB trajectories and validated in HM3D and real-world settings, achieving competitive zero-shot performance and demonstrating robustness to camera variations. The study highlights the practical potential of integrating pixel-level goals, vision-language perception, and structured LLM planning to bridge foundation models with embodied navigation.

Abstract

Zero-shot object navigation is a challenging task for home-assistance robots. This task emphasizes visual grounding, commonsense inference and locomotion abilities, where the first two are inherent in foundation models. But for the locomotion part, most works still depend on map-based planning approaches. The gap between RGB space and map space makes it difficult to directly transfer the knowledge from foundation models to navigation tasks. In this work, we propose a Pixel-guided Navigation skill (PixNav), which bridges the gap between the foundation models and the embodied navigation task. It is straightforward for recent foundation models to indicate an object by pixels, and with pixels as the goal specification, our method becomes a versatile navigation policy towards all different kinds of objects. Besides, our PixNav is a pure RGB-based policy that can reduce the cost of home-assistance robots. Experiments demonstrate the robustness of the PixNav which achieves 80+% success rate in the local path-planning task. To perform long-horizon object navigation, we design an LLM-based planner to utilize the commonsense knowledge between objects and rooms to select the best waypoint. Evaluations across both photorealistic indoor simulators and real-world environments validate the effectiveness of our proposed navigation strategy. Code and video demos are available at https://github.com/wzcai99/Pixel-Navigator.
Paper Structure (17 sections, 2 equations, 5 figures, 4 tables)

This paper contains 17 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: For the home-assistance robot, humans may ask for searching and interacting with uncommon objects like 'a rocking chair' and 'a infant bed'. It is challenging for the robot to develop a multi-modal navigation policy that can accommodate various types of objects using both text and image inputs. However, since foundation models are capable of zero-shot image understanding, any objects can be indicated by a single pixel when utilizing such foundation models. Therefore, we train a navigation policy with pixels as goal specifications, enabling the robot to navigate towards arbitrary objects. Since each object typically comprises hundreds of pixels, this approach offers a wide range of possible navigation trajectories, which greatly enlarges the scale of the navigation dataset.
  • Figure 2: The pipeline of our RGB-centric strategy for zero-shot object navigation. In each cycle, the agent first gets the panoramic images of the surroundings. This can be finished by controlling the robot by turning around or installing multiple RGB cameras. And a vision-language model translates this visual data into a textual description. Utilizing a systematically crafted step-by-step prompt template, the LLM then strategizes the most optimal next step for the target location. Then, the target location will be indicated as a pixel and conveyed to the navigation policy. The PixNav policy will continuously receive observation images and perform actions until arriving at the goal area.
  • Figure 3: An example of the VLM translation. The VLM is prompted with room-level and object-level queries. The robot performs a series of rotations, capturing six distinct images to ensure a comprehensive panoramic view.
  • Figure 4: The step-by-step prompt template for LLM to make a reasonable plan for navigation. After translating the panoramic images into text, we first use the LLM to summarize the captions into highly structured data and ask LLM to estimate the robot's location according to the commonsense of indoor environments. With the robot location, we can cluster the data into room-level instances and thus the layout can be described as a graph. Based on this graph, we then ask the LLM to provide a plan considering both exploitation and exploration.
  • Figure 5: Trajectory visualization of both PixNav policy and long-horizon object navigation. The top part represents the first-person view of PixNav trajectory. Even though the robot's view changes constantly, our model can still predict the original goal and walk towards it. The bottom part represents the top-down view of the entire trajectory. Our LLM-based planner helps the robot achieve consistent and semantic exploration towards the target.