Table of Contents
Fetching ...

Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

Hao Phung, Hadar Averbuch-Elor

TL;DR

Raster2Seq tackles the challenge of reconstructing vectorized floorplans from raster images by framing the task as generating labeled polygon sequences. It introduces an anchor-based autoregressive decoder that conditions on image features and previously produced corners, using learnable anchors to direct attention and a FeatFusion strategy to merge visual and token information. The method jointly optimizes coordinate regression, token-type, and semantic predictions, enabling variable-length outputs and per-token semantic supervision. Empirically, Raster2Seq achieves state-of-the-art results on Structure3D, CubiCasa5K, and Raster2Graph, with strong generalization to WAFFLE, and demonstrates robustness to increasing floorplan complexity and real-world data distributions. This approach opens pathways for accurate vector floorplans to drive downstream CAD workflows and controllable 3D generation, while highlighting opportunities for open-vocabulary semantics and broader architectural applications.

Abstract

Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.

Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

TL;DR

Raster2Seq tackles the challenge of reconstructing vectorized floorplans from raster images by framing the task as generating labeled polygon sequences. It introduces an anchor-based autoregressive decoder that conditions on image features and previously produced corners, using learnable anchors to direct attention and a FeatFusion strategy to merge visual and token information. The method jointly optimizes coordinate regression, token-type, and semantic predictions, enabling variable-length outputs and per-token semantic supervision. Empirically, Raster2Seq achieves state-of-the-art results on Structure3D, CubiCasa5K, and Raster2Graph, with strong generalization to WAFFLE, and demonstrates robustness to increasing floorplan complexity and real-world data distributions. This approach opens pathways for accurate vector floorplans to drive downstream CAD workflows and controllable 3D generation, while highlighting opportunities for open-vocabulary semantics and broader architectural applications.

Abstract

Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.
Paper Structure (29 sections, 6 equations, 17 figures, 18 tables, 1 algorithm)

This paper contains 29 sections, 6 equations, 17 figures, 18 tables, 1 algorithm.

Figures (17)

  • Figure 1: Our approach transforms rasterized floorplan images to vectorized format, reconstructing both its structure and semantics. We illustrate$^*$ results on held-out CubiCasa5K kalervo2019cubicasa5k test samples (left). The colors denote unique semantic categories (e.g., Outdoor, Bedroom, bath, and entry). Additionally, we highlight our model's generalization capabilities over complicated real-world floorplan images from WAFFLE ganon2025waffle (right). $^*$3D visualizations are constructed by extending the 2D boundaries vertically.
  • Figure 2: Method Overview. Given a rasterized floorplan image (left), our approach converts it into vectorized format, represented as a labeled polygon sequence, separated using special <SEP> tokens. The main architectural component of our framework is an anchor-based autoregressive decoder, which predicts the next token given image features ($f_{img}$), learnable anchors ($v_{anc}$) and the previously generated tokens; see Section \ref{['sec:arch']} for additional details. Above, we visualize the first two labeled polygons predicted (colored in orange and pink, respectively).
  • Figure 3: Illustration of our anchor-based autoregressive decoder.
  • Figure 4: Given an input rasterized image, our method performs sequential corner prediction. We visualize earlier corners in cooler colors (predictions are enumerated per room). As illustrated above, within each room, corners are predicted in counterclockwise order.
  • Figure 5: Performance vs. floorplan complexity---as approximated by the total number of polygons (left) and the total number of corners (right). As illustrated above over Structured3D-B (top) and CubiCasa5K (bottom), our approach yields larger gains as the floorplan complexity increases.
  • ...and 12 more figures