Table of Contents
Fetching ...

uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images

Jonathan Lee, Bolivar Solarte, Chin-Hsuan Wu, Jin-Cheng Jhang, Fu-En Wang, Yi-Hsuan Tsai, Min Sun

TL;DR

uLayout introduces a unified, end-to-end model for room layout estimation that handles both perspective and panoramic images by projecting inputs into a shared equirectangular space and aligning perspective horizons via a vertical-shift allocation. It employs a dual-branch, shared ResNet-50 feature extractor with domain-specific 1D convolutions and a SWG-Transformer to capture local and global geometry, followed by a joint loss that combines image-domain boundaries and horizon-depth terms. Joint training on panoramic and perspective data yields competitive results across standard benchmarks and notably improves perspective-boundary accuracy when paired with LSUN data, while significantly reducing computation through efficient feature extraction. The approach bridges modality gaps, enables robust cross-domain generalization, and delivers practical benefits for real-world room-layout tasks. Code availability further supports reproducibility and adaptation in downstream applications.

Abstract

We present uLayout, a unified model for estimating room layout geometries from both perspective and panoramic images, whereas traditional solutions require different model designs for each image type. The key idea of our solution is to unify both domains into the equirectangular projection, particularly, allocating perspective images into the most suitable latitude coordinate to effectively exploit both domains seamlessly. To address the Field-of-View (FoV) difference between the input domains, we design uLayout with a shared feature extractor with an extra 1D-Convolution layer to condition each domain input differently. This conditioning allows us to efficiently formulate a column-wise feature regression problem regardless of the FoV input. This simple yet effective approach achieves competitive performance with current state-of-the-art solutions and shows for the first time a single end-to-end model for both domains. Extensive experiments in the real-world datasets, LSUN, Matterport3D, PanoContext, and Stanford 2D-3D evidence the contribution of our approach. Code is available at https://github.com/JonathanLee112/uLayout.

uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images

TL;DR

uLayout introduces a unified, end-to-end model for room layout estimation that handles both perspective and panoramic images by projecting inputs into a shared equirectangular space and aligning perspective horizons via a vertical-shift allocation. It employs a dual-branch, shared ResNet-50 feature extractor with domain-specific 1D convolutions and a SWG-Transformer to capture local and global geometry, followed by a joint loss that combines image-domain boundaries and horizon-depth terms. Joint training on panoramic and perspective data yields competitive results across standard benchmarks and notably improves perspective-boundary accuracy when paired with LSUN data, while significantly reducing computation through efficient feature extraction. The approach bridges modality gaps, enables robust cross-domain generalization, and delivers practical benefits for real-world room-layout tasks. Code availability further supports reproducibility and adaptation in downstream applications.

Abstract

We present uLayout, a unified model for estimating room layout geometries from both perspective and panoramic images, whereas traditional solutions require different model designs for each image type. The key idea of our solution is to unify both domains into the equirectangular projection, particularly, allocating perspective images into the most suitable latitude coordinate to effectively exploit both domains seamlessly. To address the Field-of-View (FoV) difference between the input domains, we design uLayout with a shared feature extractor with an extra 1D-Convolution layer to condition each domain input differently. This conditioning allows us to efficiently formulate a column-wise feature regression problem regardless of the FoV input. This simple yet effective approach achieves competitive performance with current state-of-the-art solutions and shows for the first time a single end-to-end model for both domains. Extensive experiments in the real-world datasets, LSUN, Matterport3D, PanoContext, and Stanford 2D-3D evidence the contribution of our approach. Code is available at https://github.com/JonathanLee112/uLayout.

Paper Structure

This paper contains 21 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Method Overview. uLayout is a unified layout estimation model jointly trained with panoramic and perspective images for predicting ceiling and floor boundaries. uLayout is designed to efficiently handle images with different Field-of-Views (FoV) (See comparison in equirectangular coordinate in the middle) in both training and inference. uLayout achieves highly competitive performance on LSUN lsun_dataset and Matterport3D matterport3d datasets.
  • Figure 2: Vertical shift. Starting with a perspective image (a), we project the image into panoramic coordinates (b) where the yellow line represents the current horizon of the perspective image. We then adjust the pitch orientation by shifting the image downwards to align the yellow line (horizon in perspective) with the red line (desired horizon in the panorama) as shown in (c).
  • Figure 3: Architecture of uLayout. Firstly, a perspective image is mapped into the same equirectangular coordinate as a panoramic image and applied the reduction method proposed in \ref{['sec:effi_feature_extract']}. Therefore, the image width of the perspective image is much smaller than the width of the panoramic image ($w_{image}^{pano} >w_{image}^{pp}$). As a result, the feature maps after CNN and 1D convolution layers have different sizes in the horizontal dimension ($w_{feature}^{pano} >w_{feature}^{pp}$). Hence, the FLOPs and memory usage of perspective images for CNN and 1D convolution computation are much smaller than panoramic images. Additionally, a dual-branch design allows for seamless feature extraction regardless of the different Fields-of-View (FoV) in perspective and panoramic domains. Finally, We utilize the SWG-Transformer as our framework and estimate the boundary for both domains, as described in \ref{['sec:swg-transformer']} and \ref{['sec:prediction_and_loss']}.
  • Figure 4: Qualitative Results for Panoramic Images. Red line denote ground truth layout. Cyan lines denote predicted layout.
  • Figure 5: Qualitative Results for Perspective Images. Red line denote ground truth layout. Cyan lines denote predicted layout.