Table of Contents
Fetching ...

Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval

Davide Buoso, Luke Robinson, Giuseppe Averta, Philip Torr, Tim Franzmeyer, Daniele De Martini

TL;DR

This work introduces Select2Plan (S2P), a novel training-free framework for high-level robot planning that leverages off-the-shelf Vision-Language Models (VLMs) for autonomous navigation that overcomes the need for fine-tuning by adapting inputs to align with the VLM's pretraining data.

Abstract

This study explores the potential of off-the-shelf Vision-Language Models (VLMs) for high-level robot planning in the context of autonomous navigation. Indeed, while most of existing learning-based approaches for path planning require extensive task-specific training/fine-tuning, we demonstrate how such training can be avoided for most practical cases. To do this, we introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning which completely eliminates the need for fine-tuning or specialised training. By leveraging structured Visual Question-Answering (VQA) and In-Context Learning (ICL), our approach drastically reduces the need for data collection, requiring a fraction of the task-specific data typically used by trained models, or even relying only on online data. Our method facilitates the effective use of a generally trained VLM in a flexible and cost-efficient way, and does not require additional sensing except for a simple monocular camera. We demonstrate its adaptability across various scene types, context sources, and sensing setups. We evaluate our approach in two distinct scenarios: traditional First-Person View (FPV) and infrastructure-driven Third-Person View (TPV) navigation, demonstrating the flexibility and simplicity of our method. Our technique significantly enhances the navigational capabilities of a baseline VLM of approximately 50% in TPV scenario, and is comparable to trained models in the FPV one, with as few as 20 demonstrations.

Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval

TL;DR

This work introduces Select2Plan (S2P), a novel training-free framework for high-level robot planning that leverages off-the-shelf Vision-Language Models (VLMs) for autonomous navigation that overcomes the need for fine-tuning by adapting inputs to align with the VLM's pretraining data.

Abstract

This study explores the potential of off-the-shelf Vision-Language Models (VLMs) for high-level robot planning in the context of autonomous navigation. Indeed, while most of existing learning-based approaches for path planning require extensive task-specific training/fine-tuning, we demonstrate how such training can be avoided for most practical cases. To do this, we introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning which completely eliminates the need for fine-tuning or specialised training. By leveraging structured Visual Question-Answering (VQA) and In-Context Learning (ICL), our approach drastically reduces the need for data collection, requiring a fraction of the task-specific data typically used by trained models, or even relying only on online data. Our method facilitates the effective use of a generally trained VLM in a flexible and cost-efficient way, and does not require additional sensing except for a simple monocular camera. We demonstrate its adaptability across various scene types, context sources, and sensing setups. We evaluate our approach in two distinct scenarios: traditional First-Person View (FPV) and infrastructure-driven Third-Person View (TPV) navigation, demonstrating the flexibility and simplicity of our method. Our technique significantly enhances the navigational capabilities of a baseline VLM of approximately 50% in TPV scenario, and is comparable to trained models in the FPV one, with as few as 20 demonstrations.

Paper Structure

This paper contains 18 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: High-level demonstration of S2P in a tpv scenario. The robot must reach the red mark from its location, controlled solely via the external camera, shown in the figure. S2P proposes candidate keypoints -- in yellow -- and draws them into the original image before requesting a feasible trajectory to an off-the-shelf vlm. The latter will output a trajectory -- green -- as a sequence of keypoints, ideally yielding a trajectory that avoids obstacles -- e.g. 3 and 9.
  • Figure 2: Overview of the proposed approach in (a) and (b). The two settings are designed to fit two specific scenarios but share their components. The framework takes a live image from the onboard or a CCTV camera and retrieves similar images from the experiential memory. It is then annotated and passed, with the sampled images and an optional episodic memory, to the to retrieve the next commands to send to the platform and explanations. The main difference is the absence of an Episodic Memory in the setting, where the off-board sensing setup empirically limits its benefits. Alongside the overview, response examples are presented for both setups.
  • Figure 3: The figure depicts a scenario where the agent uses the compass. The compass keeps track of the scene content as the robot rotates, remembering insightful information about the room's layout. For instance, if the agent is looking for a chair, it will likely rotate towards where it last saw a table, although it is now out of sight.
  • Figure 4: Examples rooms in the scenario. Random obstacles are placed to challenge the planner, e.g. the blue chair.
  • Figure 5: Experiential Memories for : D includes experiences from the same environment excluding the inference room, O from online videos and H from the same environment but with a human as navigator instead of a robot.