Table of Contents
Fetching ...

Enhancing Floor Plan Recognition: A Hybrid Mix-Transformer and U-Net Approach for Precise Wall Segmentation

Dmitriy Parashchuk, Alexey Kapshitskiy, Yuriy Karyakin

TL;DR

This work addresses the challenge of precise wall segmentation in floor-plan images to enable reliable 3D reconstruction. It introduces MitUNet, a hybrid architecture that fuses a Mix-Transformer encoder with a U-Net decoder and scSE attention, optimized using asymmetric Tversky loss to balance boundary precision and recall. Across CubiCasa5k and a regional dataset, MitUNet achieves state-of-the-art boundary accuracy and efficient memory usage, with a two-stage transfer learning strategy enabling domain adaptation to complex regional hatchings. The authors provide public code and a regional dataset to promote reproducibility and further development in Scan-to-BIM pipelines.

Abstract

Automatic 3D reconstruction of indoor spaces from 2D floor plans necessitates high-precision semantic segmentation of structural elements, particularly walls. However, existing methods often struggle with detecting thin structures and maintaining geometric precision. This study introduces MitUNet, a hybrid neural network combining a Mix-Transformer encoder and a U-Net decoder enhanced with spatial and channel attention blocks. Our approach, optimized with the Tversky loss function, achieves a balance between precision and recall, ensuring accurate boundary recovery. Experiments on the CubiCasa5k dataset and a proprietary regional dataset demonstrate MitUNet's superiority in generating structurally correct masks with high boundary accuracy, outperforming standard models. This tool provides a robust foundation for automated 3D reconstruction pipelines. To ensure reproducibility and facilitate future research, the source code and the proprietary regional dataset are publicly available at https://github.com/aliasstudio/mitunet and https://doi.org/10.5281/zenodo.17871079 respectively.

Enhancing Floor Plan Recognition: A Hybrid Mix-Transformer and U-Net Approach for Precise Wall Segmentation

TL;DR

This work addresses the challenge of precise wall segmentation in floor-plan images to enable reliable 3D reconstruction. It introduces MitUNet, a hybrid architecture that fuses a Mix-Transformer encoder with a U-Net decoder and scSE attention, optimized using asymmetric Tversky loss to balance boundary precision and recall. Across CubiCasa5k and a regional dataset, MitUNet achieves state-of-the-art boundary accuracy and efficient memory usage, with a two-stage transfer learning strategy enabling domain adaptation to complex regional hatchings. The authors provide public code and a regional dataset to promote reproducibility and further development in Scan-to-BIM pipelines.

Abstract

Automatic 3D reconstruction of indoor spaces from 2D floor plans necessitates high-precision semantic segmentation of structural elements, particularly walls. However, existing methods often struggle with detecting thin structures and maintaining geometric precision. This study introduces MitUNet, a hybrid neural network combining a Mix-Transformer encoder and a U-Net decoder enhanced with spatial and channel attention blocks. Our approach, optimized with the Tversky loss function, achieves a balance between precision and recall, ensuring accurate boundary recovery. Experiments on the CubiCasa5k dataset and a proprietary regional dataset demonstrate MitUNet's superiority in generating structurally correct masks with high boundary accuracy, outperforming standard models. This tool provides a robust foundation for automated 3D reconstruction pipelines. To ensure reproducibility and facilitate future research, the source code and the proprietary regional dataset are publicly available at https://github.com/aliasstudio/mitunet and https://doi.org/10.5281/zenodo.17871079 respectively.

Paper Structure

This paper contains 15 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Representative samples from our proprietary Regional Dataset demonstrating key segmentation challenges: complex wall hatching patterns (differentiating partitions from load-bearing walls), non-Manhattan geometry, and dense semantic clutter (furniture, text, and dimension lines).
  • Figure 2: Qualitative comparison of segmentation results on the Regional Dataset. (a) Original input; (b) Ground Truth; (c) UNet (scSE); (d) SegFormer; (e) MitUNet (Ours). The bottom row displays zoomed-in details corresponding to models c, d, and e. Note that UNet introduces noise artifacts (center crop), and SegFormer suffers from dilated or blurred boundaries, whereas MitUNet successfully suppresses noise while maintaining sharp structural edges.
  • Figure 3: Visual comparison of loss functions during the ablation phase. (a) Original input; (b) Tversky ($\alpha=0.7, \beta=0.3$) yields the sharpest, thinnest boundaries; (c) Dice Loss results in dilated wall thickness; (d) Focal Loss exhibits internal noise artifacts; (e) Lovasz-Softmax preserves structure but lacks boundary crispness. This comparison highlights the "cleaning" effect of the asymmetric Tversky loss.
  • Figure 4: Qualitative comparison of fine-tuned models. (a) Original input; (b)MitUNet (Ours) trained with Tversky $\alpha=0.6, \beta=0.4$ demonstrates the optimal balance of connectivity and sharpness; (c) Dice Loss; (d) Focal Loss; (e) Lovasz-Softmax. Comparison reveals that our method (b) minimizes "staircase" artifacts along edges compared to standard losses.