Table of Contents
Fetching ...

No More Ambiguity in 360° Room Layout via Bi-Layout Estimation

Yu-Ju Tsai, Jin-Cheng Jhang, Jingjing Zheng, Wei Wang, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo, Ming-Hsuan Yang

TL;DR

This work tackles the inherent ambiguity in 360° room layout annotations by predicting two distinct layouts per image: enclosed and extended. It introduces a Bi-Layout architecture that uses two global context embeddings and a shared feature guidance module to generate both predictions efficiently, paired with a disambiguate metric for robust evaluation under ambiguous ground truth. Empirical results on MatterportLayout and ZInD demonstrate state-of-the-art performance, improved 3DIoU and notable gains on highly ambiguous subsets, as well as the ability to detect ambiguous regions. The approach offers a compact, scalable solution for multi-layout reasoning with practical implications for indoor scene understanding and downstream applications.

Abstract

Inherent ambiguity in layout annotations poses significant challenges to developing accurate 360° room layout estimation models. To address this issue, we propose a novel Bi-Layout model capable of predicting two distinct layout types. One stops at ambiguous regions, while the other extends to encompass all visible areas. Our model employs two global context embeddings, where each embedding is designed to capture specific contextual information for each layout type. With our novel feature guidance module, the image feature retrieves relevant context from these embeddings, generating layout-aware features for precise bi-layout predictions. A unique property of our Bi-Layout model is its ability to inherently detect ambiguous regions by comparing the two predictions. To circumvent the need for manual correction of ambiguous annotations during testing, we also introduce a new metric for disambiguating ground truth layouts. Our method demonstrates superior performance on benchmark datasets, notably outperforming leading approaches. Specifically, on the MatterportLayout dataset, it improves 3DIoU from 81.70% to 82.57% across the full test set and notably from 54.80% to 59.97% in subsets with significant ambiguity. Project page: https://liagm.github.io/Bi_Layout/

No More Ambiguity in 360° Room Layout via Bi-Layout Estimation

TL;DR

This work tackles the inherent ambiguity in 360° room layout annotations by predicting two distinct layouts per image: enclosed and extended. It introduces a Bi-Layout architecture that uses two global context embeddings and a shared feature guidance module to generate both predictions efficiently, paired with a disambiguate metric for robust evaluation under ambiguous ground truth. Empirical results on MatterportLayout and ZInD demonstrate state-of-the-art performance, improved 3DIoU and notable gains on highly ambiguous subsets, as well as the ability to detect ambiguous regions. The approach offers a compact, scalable solution for multi-layout reasoning with practical implications for indoor scene understanding and downstream applications.

Abstract

Inherent ambiguity in layout annotations poses significant challenges to developing accurate 360° room layout estimation models. To address this issue, we propose a novel Bi-Layout model capable of predicting two distinct layout types. One stops at ambiguous regions, while the other extends to encompass all visible areas. Our model employs two global context embeddings, where each embedding is designed to capture specific contextual information for each layout type. With our novel feature guidance module, the image feature retrieves relevant context from these embeddings, generating layout-aware features for precise bi-layout predictions. A unique property of our Bi-Layout model is its ability to inherently detect ambiguous regions by comparing the two predictions. To circumvent the need for manual correction of ambiguous annotations during testing, we also introduce a new metric for disambiguating ground truth layouts. Our method demonstrates superior performance on benchmark datasets, notably outperforming leading approaches. Specifically, on the MatterportLayout dataset, it improves 3DIoU from 81.70% to 82.57% across the full test set and notably from 54.80% to 59.97% in subsets with significant ambiguity. Project page: https://liagm.github.io/Bi_Layout/
Paper Structure (31 sections, 5 equations, 14 figures, 7 tables)

This paper contains 31 sections, 5 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Inherent ambiguity in the MatterportLayout zou2021manhattan.Blue and Green represent ground truth annotations and predictions from the SoTA models, respectively. The layout boundaries are shown on the left, and their bird's-eye view projections are on the right. We define two types of layout annotation: (a) enclosed type encloses the room. (b) extended type extends to all visible areas. The dashed lines underscore the ambiguity in the dataset labels.
  • Figure 2: Comparions of our Bi-Layout model and the SoTA models.Blue and Green indicate ground truth labels and predictions, respectively. For each image, the layout boundaries are shown on the left, while their bird's-eye view projections are shown on the right. Our Bi-Layout model can predict two extremely different types of layouts (enclosed and extended), addressing the ambiguity issue that the SoTA methods struggle with.
  • Figure 3: Our Bi-Layout network architecture. (a) Feature extractor: It processes a panoramic image $I$ using ResNet-50 to extract multi-scale features $F_{l}$ and then feeds those features into the Simplified Height Compression Module (SHCM) to produce the final compressed feature $F_{c}$. (b) Global Context Embedding: It consists of two learnable embeddings $E_{k}$, each designed to capture and encode the contextual information inherent in the corresponding type of layout labels. (c) Shared Feature Guidance Module: It consists of two components: Guided Cross-Attention and SWG Self-Attention. It guides the fusion of compressed feature $F_{c}$ with the global context embedding $E_{k}$ to generate feature $F_{g}^{k}$ ($k \in [\textit{extended},~\textit{enclosed}]$) more aligned for the corresponding layout type. Finally, we use fully connected (FC) layers to map $F_{g}^{k}$ to horizon-depth and room height, which are further converted to boundary layouts ($P_\text{extended}$ and $P_\text{enclosed}$).
  • Figure 4: Our Shared Feature Guidance Module architecture (SFGM). It consists of two blocks: Guided Cross-Attention and SWG Self-Attention. The module has $M=8$ layers, and the structure of each layer is identical. Given the compressed image feature $F_{c}$ and global context embedding $E_{k}$, we first apply the sinusoidal and learnable positional encoding, respectively. With the compressed feature $F_{c}$ as the query $\mathbf{Q}$ and our global context embedding $E_{k}$ as both the key $\mathbf{K}$ and value $\mathbf{V}$, our guided cross-attention generates the feature $F_{gca}^{k}$, and it is served as $\mathbf{QKV}$ inputs to SWG self-attention. This process will repeat and further refine the output feature with our global context embedding to generate the final guided feature $F_{g}^{k}$.
  • Figure 5: Qualitative comparison on the MatterportLayout zou2021manhattan (top) and ZInd cruz2021zillow datasets (bottom). Blue and Green represent ground truth labels and predictions, respectively. The boundaries of the room layout are on the left, and their bird's eye view projections are on the right. We show our disambiguate results, which effectively address the ambiguity issue, while the SoTA methods struggle with the ambiguity, as highlighted in dashed lines.
  • ...and 9 more figures