Table of Contents
Fetching ...

Evaluating Vision-Language Models as Evaluators in Path Planning

Mohamed Aghzal, Xiang Yue, Erion Plaku, Ziyu Yao

TL;DR

This work interrogates the use of Vision-Language Models (VLMs) as evaluators in path planning via the PathEval benchmark. PathEval presents pairs of paths in diverse environments and requires a VLM to abstract task-relevant descriptors, perceptually compare low-level path details, and integrate information to select the better path under natural-language scenarios. Across 9 SOTA VLMs, zero-shot performance is near random, but providing explicit descriptor values significantly improves accuracy, exposing a vision bottleneck in low-level perception that is not solved by end-to-end fine-tuning alone. The findings highlight the need for task-specific discriminative adaptation of vision encoders and present PathEval as a flexible, scalable benchmark to drive future advances in integrating foundation models with planning in complex environments.

Abstract

Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning. This has inspired an intriguing question: if these models cannot plan well, can they still contribute to the planning framework as a helpful plan evaluator? In this work, we generalize this question to consider LLMs augmented with visual understanding, i.e., Vision-Language Models (VLMs). We introduce PathEval, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios. Succeeding in the benchmark requires a VLM to be able to abstract traits of optimal paths from the scenario description, demonstrate precise low-level perception on each path, and integrate this information to decide the better path. Our analysis of state-of-the-art VLMs reveals that these models face significant challenges on the benchmark. We observe that the VLMs can precisely abstract given scenarios to identify the desired traits and exhibit mixed performance in integrating the provided information. Yet, their vision component presents a critical bottleneck, with models struggling to perceive low-level details about a path. Our experimental results show that this issue cannot be trivially addressed via end-to-end fine-tuning; rather, task-specific discriminative adaptation of these vision encoders is needed for these VLMs to become effective path evaluators.

Evaluating Vision-Language Models as Evaluators in Path Planning

TL;DR

This work interrogates the use of Vision-Language Models (VLMs) as evaluators in path planning via the PathEval benchmark. PathEval presents pairs of paths in diverse environments and requires a VLM to abstract task-relevant descriptors, perceptually compare low-level path details, and integrate information to select the better path under natural-language scenarios. Across 9 SOTA VLMs, zero-shot performance is near random, but providing explicit descriptor values significantly improves accuracy, exposing a vision bottleneck in low-level perception that is not solved by end-to-end fine-tuning alone. The findings highlight the need for task-specific discriminative adaptation of vision encoders and present PathEval as a flexible, scalable benchmark to drive future advances in integrating foundation models with planning in complex environments.

Abstract

Despite their promise to perform complex reasoning, large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning. This has inspired an intriguing question: if these models cannot plan well, can they still contribute to the planning framework as a helpful plan evaluator? In this work, we generalize this question to consider LLMs augmented with visual understanding, i.e., Vision-Language Models (VLMs). We introduce PathEval, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios. Succeeding in the benchmark requires a VLM to be able to abstract traits of optimal paths from the scenario description, demonstrate precise low-level perception on each path, and integrate this information to decide the better path. Our analysis of state-of-the-art VLMs reveals that these models face significant challenges on the benchmark. We observe that the VLMs can precisely abstract given scenarios to identify the desired traits and exhibit mixed performance in integrating the provided information. Yet, their vision component presents a critical bottleneck, with models struggling to perceive low-level details about a path. Our experimental results show that this issue cannot be trivially addressed via end-to-end fine-tuning; rather, task-specific discriminative adaptation of these vision encoders is needed for these VLMs to become effective path evaluators.

Paper Structure

This paper contains 34 sections, 7 figures, 13 tables.

Figures (7)

  • Figure 2: Example segment complexity test cases in simplified environments and performance across the various settings.
  • Figure 3: GPT-4o performance per scenario (2D)
  • Figure 4: Examples of model failure on PathEval.
  • Figure 5: Examples of model failure on PathEval when prompted with w/ descriptor values.
  • Figure : 2D
  • ...and 2 more figures