Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Sizhong Qin; Ramon Elias Weber; Xinzheng Lu

Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Sizhong Qin, Ramon Elias Weber, Xinzheng Lu

Abstract

Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.

Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Abstract

Paper Structure (35 sections, 23 equations, 9 figures, 8 tables)

This paper contains 35 sections, 23 equations, 9 figures, 8 tables.

Introduction
Related Work
Problem Formulation
Method
Room-Instance Tokenization
Multimodal Alignment and Instruction Tuning
Experiments
Benchmark Construction and Data Processing
Evaluation Protocols and Metrics
Quantitative Analysis
Qualitative Analysis
Ablations
Discussion
Conclusion
Implementation Details
...and 20 more sections

Figures (9)

Figure 1: HouseMind learns the language of space by modeling outlines and rooms as spatial tokens. Through hierarchical tokenization and multimodal reasoning, it can understand, generate, and edit architectural floor plans from natural language prompts.
Figure 2: Understanding: given a prompt, an outline, and an existing floor plan, the model outputs a textual description, a bubble diagram, and structured JSON capturing spatial semantics. Generation: given a prompt and an outline, the model produces a complete, coherent floor plan. Editing: given a prompt, an outline, and a reference floor plan, the model outputs an updated plan aligned with the editing intent.
Figure 3: Overall framework of HouseMind. The model is trained through a three-stage multimodal alignment and instruction tuning pipeline: (S1) Embedding Initialization establishes cross-modal compatibility between geometric and linguistic tokens; (S2) Multimodal Pre-training aligns text and spatial representations; and (S3) Instruction Tuning (SFT) enables task-aware spatial reasoning.
Figure 4: Qualitative comparison results of understanding, generation, and editing. For understanding tasks (U), HouseMind accurately identifies the number of rooms and their connections. For generation tasks (G), HouseMind preserves both the room layout and the overall outline consistency; the generation prompts are provided in the supplementary materials. For editing tasks (E), HouseMind accurately executes the specified modifications when the instructions are explicit.
Figure A1: VQ-VAE tokenization framework. The outline branch encodes the global boundary, while the conditional room branch encodes each room with outline context to capture spatial relations.
...and 4 more figures

Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Abstract

Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

Authors

Abstract

Table of Contents

Figures (9)