Table of Contents
Fetching ...

Vision Language Models Can Parse Floor Plan Maps

David DeFazio, Hrudayangam Mehta, Meng Wang, Ping Yang, Jeremy Blackburn, Shiqi Zhang

TL;DR

This work introduces map parsing as a novel task for vision-language models, enabling navigation plan generation directly from floor-plan images. By preprocessing floor plans to remove clutter and densely labeling spaces, and by prompting VLMs (GPT-4o and Claude-3.5 Sonnet) with explicit start/goal information, the authors produce actionable navigation sequences in JSON that a robot can execute. They report high success rates on multi-step tasks (up to $0.96$ for sequences of $9$ actions) and show that map size, task difficulty, and label density significantly affect performance, with dense labeling and map processing yielding substantial gains. A hardware demonstration on a quadruped robot confirms real-world viability, while the study outlines automation paths for map enhancement and discusses limitations and future work to extend the approach to larger and outdoor environments.

Abstract

Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks, from visual question answering to image captioning. This paper focuses on map parsing, a novel task that is unexplored within the VLM context and particularly useful to mobile robots. Map parsing requires understanding not only the labels but also the geometric configurations of a map, i.e., what areas are like and how they are connected. To evaluate the performance of VLMs on map parsing, we prompt VLMs with floor plan maps to generate task plans for complex indoor navigation. Our results demonstrate the remarkable capability of VLMs in map parsing, with a success rate of 0.96 in tasks requiring a sequence of nine navigation actions, e.g., approaching and going through doors. Other than intuitive observations, e.g., VLMs do better in smaller maps and simpler navigation tasks, there was a very interesting observation that its performance drops in large open areas. We provide practical suggestions to address such challenges as validated by our experimental results. Webpage: https://sites.google.com/view/vlm-floorplan/

Vision Language Models Can Parse Floor Plan Maps

TL;DR

This work introduces map parsing as a novel task for vision-language models, enabling navigation plan generation directly from floor-plan images. By preprocessing floor plans to remove clutter and densely labeling spaces, and by prompting VLMs (GPT-4o and Claude-3.5 Sonnet) with explicit start/goal information, the authors produce actionable navigation sequences in JSON that a robot can execute. They report high success rates on multi-step tasks (up to for sequences of actions) and show that map size, task difficulty, and label density significantly affect performance, with dense labeling and map processing yielding substantial gains. A hardware demonstration on a quadruped robot confirms real-world viability, while the study outlines automation paths for map enhancement and discusses limitations and future work to extend the approach to larger and outdoor environments.

Abstract

Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks, from visual question answering to image captioning. This paper focuses on map parsing, a novel task that is unexplored within the VLM context and particularly useful to mobile robots. Map parsing requires understanding not only the labels but also the geometric configurations of a map, i.e., what areas are like and how they are connected. To evaluate the performance of VLMs on map parsing, we prompt VLMs with floor plan maps to generate task plans for complex indoor navigation. Our results demonstrate the remarkable capability of VLMs in map parsing, with a success rate of 0.96 in tasks requiring a sequence of nine navigation actions, e.g., approaching and going through doors. Other than intuitive observations, e.g., VLMs do better in smaller maps and simpler navigation tasks, there was a very interesting observation that its performance drops in large open areas. We provide practical suggestions to address such challenges as validated by our experimental results. Webpage: https://sites.google.com/view/vlm-floorplan/
Paper Structure (21 sections, 10 figures, 3 tables)

This paper contains 21 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Quadruped robot executing a VLM-generated plan (current action highlighted in red) to complete a navigation task while localizing directly on a floor plan image.
  • Figure 2: Overview of our method. A robot takes a raw image of a floor plan, which is then enhanced with labels and door indicators. The enhanced floor plan, along with a text prompt specifying the start and goal locations is given to a VLM. The VLM generates a navigation plan to reach the goal location, and the plan is executed on a mobile robot.
  • Figure 3: Text prompt input to VLM to generate navigation plans. We define the starting and ending locations, action types, and ask for explicit room and door connections to gain insights as to how the VLM understands the map.
  • Figure 4: Three raw floor plan maps.
  • Figure 5: Three maps used in our experiments for evaluating the performance of VLMs in map parsing and plan generation.
  • ...and 5 more figures