SVG Decomposition for Enhancing Large Multimodal Models Visualization Comprehension: A Study with Floor Plans
Jeongah Lee, Ali Sarvghad
TL;DR
This study investigates whether decomposing floor-plan visualizations using scalable vector graphics (SVG) enhances large multimodal models' spatial understanding. By evaluating GPT‑4o, Claude 3.7 Sonnet, and Llama 3.2 11B Vision Instruct on 75 floor plans under PNG, SVG, and PNG+SVG inputs, the authors demonstrate that SVG+PNG generally improves subspace counting and labeling, but can hinder pathfinding and holistic spatial reasoning, in a model- and complexity-dependent manner. The work leverages CubiCasa5K and CubiGraph5K representations to ground subspaces as graph nodes and examine connectivity, offering a nuanced view of how structured, vector-based inputs interact with LMMs' reasoning. These findings highlight the potential and limitations of vector decomposition for spatial visualization comprehension and motivate future integration of vector inputs with semantic annotations or additional modalities to support accessibility and navigation tasks in real-world settings.
Abstract
Large multimodal models (LMMs) are increasingly capable of interpreting visualizations, yet they continue to struggle with spatial reasoning. One proposed strategy is decomposition, which breaks down complex visualizations into structured components. In this work, we examine the efficacy of scalable vector graphics (SVGs) as a decomposition strategy for improving LMMs' performance on floor plans comprehension. Floor plans serve as a valuable testbed because they combine geometry, topology, and semantics, and their reliable comprehension has real-world applications, such as accessibility for blind and low-vision individuals. We conducted an exploratory study with three LMMs (GPT-4o, Claude 3.7 Sonnet, and Llama 3.2 11B Vision Instruct) across 75 floor plans. Results show that combining SVG with raster input (SVG+PNG) improves performance on spatial understanding tasks but often hinders spatial reasoning, particularly in pathfinding. These findings highlight both the promise and limitations of decomposition as a strategy for advancing spatial visualization comprehension.
