Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction
Hao Phung, Hadar Averbuch-Elor
TL;DR
Raster2Seq tackles the challenge of reconstructing vectorized floorplans from raster images by framing the task as generating labeled polygon sequences. It introduces an anchor-based autoregressive decoder that conditions on image features and previously produced corners, using learnable anchors to direct attention and a FeatFusion strategy to merge visual and token information. The method jointly optimizes coordinate regression, token-type, and semantic predictions, enabling variable-length outputs and per-token semantic supervision. Empirically, Raster2Seq achieves state-of-the-art results on Structure3D, CubiCasa5K, and Raster2Graph, with strong generalization to WAFFLE, and demonstrates robustness to increasing floorplan complexity and real-world data distributions. This approach opens pathways for accurate vector floorplans to drive downstream CAD workflows and controllable 3D generation, while highlighting opportunities for open-vocabulary semantics and broader architectural applications.
Abstract
Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.
