Table of Contents
Fetching ...

Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

Wenxi Wu, Jingjing Zhang, Martim Brandão

Abstract

Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.

Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

Abstract

Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.
Paper Structure (17 sections, 9 figures)

This paper contains 17 sections, 9 figures.

Figures (9)

  • Figure 1: Examples of language-constrained robot motion planning problems, and the solutions scored highest by Qwen2.5-VL (dotted trajectories).
  • Figure 2: Examples of single-image trajectory trails in 4 scenes: a) move towards the door; b) move to the table in the kitchen; c) move to the table; d) move to the shelf behind the columns.
  • Figure 3: Example of galleries of screenshots showing the robot moving along each trajectory, with each row presenting a sequence screenshots from one trajectory. (a) The robot goes around the table in 2 topologically different ways; (b) The robot goes around the table and column in 2 topologically different ways.
  • Figure 4: Accuracy of VLMs in selecting candidate paths in images with different query methods, averaged across two types of preferences (object proximity and path style).
  • Figure 5: Accuracy of VLMs in selecting candidate paths in images, on navigation tasks.
  • ...and 4 more figures