Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Jiaqi Chen; Bingqian Lin; Xinmin Liu; Lin Ma; Xiaodan Liang; Kwan-Yee K. Wong

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiaodan Liang, Kwan-Yee K. Wong

TL;DR

AO-Planner introduces a zero-shot affordances-oriented framework for continuous vision-language navigation (VLN-CE) that links LLM-driven low-level motion planning with high-level decision making. It leverages Grounded SAM to yield navigable ground affordances, uses Visual Affordances Prompting (VAP) to produce waypoint candidates in RGB space, and employs a high-level PathAgent to select among candidate paths, finally mapping 2D pixel paths to $3$D world coordinates via depth and camera intrinsics. The approach achieves state-of-the-art zero-shot performance on R2R-CE and RxR-CE, with notable SPL gains and competitive supervised performance through waypoint distillation. This work demonstrates the viability of integrating foundation models for both low-level control and high-level reasoning in 3D navigation, offering a practical path toward zero-shot low-level motion planning and data-efficient learning from pseudo-labels.

Abstract

LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) task. However, existing LLM-based methods often focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in navigation scenarios. To bridge this gap, we propose AO-Planner, a novel Affordances-Oriented Planner for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making, both performed in a zero-shot setting. Specifically, we employ a Visual Affordances Prompting (VAP) approach, where the visible ground is segmented by SAM to provide navigational affordances, based on which the LLM selects potential candidate waypoints and plans low-level paths towards selected waypoints. We further propose a high-level PathAgent which marks planned paths into the image input and reasons the most probable path by comprehending all environmental information. Finally, we convert the selected path into 3D coordinates using camera intrinsic parameters and depth information, avoiding challenging 3D predictions for LLMs. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance (8.8% improvement on SPL). Our method can also serve as a data annotator to obtain pseudo-labels, distilling its waypoint prediction ability into a learning-based predictor. This new predictor does not require any waypoint data from the simulator and achieves 47% SR competing with supervised methods. We establish an effective connection between LLM and 3D world, presenting novel prospects for employing foundation models in low-level motion control.

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

TL;DR

D world coordinates via depth and camera intrinsics. The approach achieves state-of-the-art zero-shot performance on R2R-CE and RxR-CE, with notable SPL gains and competitive supervised performance through waypoint distillation. This work demonstrates the viability of integrating foundation models for both low-level control and high-level reasoning in 3D navigation, offering a practical path toward zero-shot low-level motion planning and data-efficient learning from pseudo-labels.

Abstract

Paper Structure (36 sections, 2 equations, 6 figures, 4 tables)

This paper contains 36 sections, 2 equations, 6 figures, 4 tables.

Introduction
Related Work
Vision-and-Language Navigation (VLN)
Navigation with Large Language Models
Visual Prompting
Method
VLN-CE Task Definition
Framework Overview
Visual Affordances Prompting (VAP)
Navigational Affordances
Visual Prompting
High-level PathAgent
3D Mapping and Motion Control
Waypoint Distillation
Experiments
...and 21 more sections

Figures (6)

Figure 1: In discrete VLN, LLMs only need to perform high-level planning by selecting a view as the forward direction (left). For continuous environments, previous agents rely on collecting simulator data to train low-level policies. In this paper, we utilize multimodal foundation models and propose visual affordances prompting to predict low-level candidate waypoints and paths in a zero-shot setting (right).
Figure 2: Our proposed low-level affordances-oriented planning framework with visual affordances prompting. First, we utilize Grounded SAM to segment the visible ground as affordances. We then introduce visual affordances prompting (VAP), where we uniformly scatter points with numeric labels within the affordances. After querying the LLM by combining the visualized new image with task definition, instruction, waypoint definition, and output requirements, we finally obtain potential waypoints and paths in this view.
Figure 3: Our proposed high-level PathAgent. Different from previous zero-shot VLN agents, we utilize visual prompting by marking candidate waypoints and their corresponding paths (i.e., Path 0-5) in all four observation directions. This allows the PathAgent to make action decisions in the proficient RGB space and then map pixel-based paths to 3D coordinates using depth information and camera intrinsic parameters.
Figure 4: An example of successful navigation in a continuous environment. We present visualizations of the motion planning results from the low-level agent (upper) and the thinking process of the high-level PathAgent based on visualized candidate paths (only the selected directions are shown in this figure) and the selection of a path ID as an action (bottom). The agent ultimately decides to stop after observing the target "closet".
Figure 5: Task prompts for the low-level Visual Affordances Prompting (VAP).
...and 1 more figures

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

TL;DR

Abstract

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)