Table of Contents
Fetching ...

FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka

TL;DR

FloorplanQA addresses unassisted spatial reasoning in LLMs by using symbolic 2D floorplans encoded in JSON/XML to probe geometric inference. It introduces a dataset of $2{,}000$ layouts ($1{,}800$ synthetic + $200$ from HSSD) and $16{,}000$ questions spanning metric, placement, visibility, and path planning, evaluated with automated, deterministic scoring. The study finds that while LLMs handle simple queries, they struggle to maintain geometric coherence under overlaps and constraints, with reasoning-focused models offering only partial improvements. The work motivates hybrid spatial solvers and geometry-aware training to advance robust, design-oriented spatial reasoning in language models, providing a fine-grained benchmark for future geometry-conscious AI.

Abstract

We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large-language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today's LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

TL;DR

FloorplanQA addresses unassisted spatial reasoning in LLMs by using symbolic 2D floorplans encoded in JSON/XML to probe geometric inference. It introduces a dataset of layouts ( synthetic + from HSSD) and questions spanning metric, placement, visibility, and path planning, evaluated with automated, deterministic scoring. The study finds that while LLMs handle simple queries, they struggle to maintain geometric coherence under overlaps and constraints, with reasoning-focused models offering only partial improvements. The work motivates hybrid spatial solvers and geometry-aware training to advance robust, design-oriented spatial reasoning in language models, providing a fine-grained benchmark for future geometry-conscious AI.

Abstract

We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large-language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today's LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

Paper Structure

This paper contains 13 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Representative layouts from FloorplanQA. Generated: kitchen, living room, bedroom. HSSD: last image. Generated objects are axis-aligned boxes; HSSD uses arbitrary polygons.
  • Figure 2: Accuracy of general (top) and reasoning (bottom) models. Left: by model. Right: by question. Each column corresponds to a specific room type represented in our dataset: Kitchens, Living Rooms, Bedrooms (synthetic subsets), plus HSSD layouts.