Table of Contents
Fetching ...

Self-training Room Layout Estimation via Geometry-aware Ray-casting

Bolivar Solarte, Chin-Hsuan Wu, Jin-Cheng Jhang, Jonathan Lee, Yi-Hsuan Tsai, Min Sun

TL;DR

This work tackles unsupervised adaptation for room layout estimation by introducing a geometry-aware self-training framework that uses a ray-casting data aggregation to produce pseudo-labels from multiple noisy estimates. The method defines a multi-view consistency objective and a multi-cycle ray-casting procedure to handle occlusions, paired with a Weighted Distance Loss that emphasizes distant geometry. Empirical results on synthetic and real datasets (e.g., HM3D-MVL, MP3D-FPE, ZInD) show substantial improvements over the prior 360-MLC approach across HorizonNet and LGTNet backbones, including challenging occlusion scenarios, approaching supervised baselines in some cases. The approach offers practical impact by enabling robust room-layout learning from unlabeled panoramic data and provides a pathway toward scalable, annotation-free 3D room understanding in diverse environments.

Abstract

In this paper, we introduce a novel geometry-aware self-training framework for room layout estimation models on unseen scenes with unlabeled data. Our approach utilizes a ray-casting formulation to aggregate multiple estimates from different viewing positions, enabling the computation of reliable pseudo-labels for self-training. In particular, our ray-casting approach enforces multi-view consistency along all ray directions and prioritizes spatial proximity to the camera view for geometry reasoning. As a result, our geometry-aware pseudo-labels effectively handle complex room geometries and occluded walls without relying on assumptions such as Manhattan World or planar room walls. Evaluation on publicly available datasets, including synthetic and real-world scenarios, demonstrates significant improvements in current state-of-the-art layout models without using any human annotation.

Self-training Room Layout Estimation via Geometry-aware Ray-casting

TL;DR

This work tackles unsupervised adaptation for room layout estimation by introducing a geometry-aware self-training framework that uses a ray-casting data aggregation to produce pseudo-labels from multiple noisy estimates. The method defines a multi-view consistency objective and a multi-cycle ray-casting procedure to handle occlusions, paired with a Weighted Distance Loss that emphasizes distant geometry. Empirical results on synthetic and real datasets (e.g., HM3D-MVL, MP3D-FPE, ZInD) show substantial improvements over the prior 360-MLC approach across HorizonNet and LGTNet backbones, including challenging occlusion scenarios, approaching supervised baselines in some cases. The approach offers practical impact by enabling robust room-layout learning from unlabeled panoramic data and provides a pathway toward scalable, annotation-free 3D room understanding in diverse environments.

Abstract

In this paper, we introduce a novel geometry-aware self-training framework for room layout estimation models on unseen scenes with unlabeled data. Our approach utilizes a ray-casting formulation to aggregate multiple estimates from different viewing positions, enabling the computation of reliable pseudo-labels for self-training. In particular, our ray-casting approach enforces multi-view consistency along all ray directions and prioritizes spatial proximity to the camera view for geometry reasoning. As a result, our geometry-aware pseudo-labels effectively handle complex room geometries and occluded walls without relying on assumptions such as Manhattan World or planar room walls. Evaluation on publicly available datasets, including synthetic and real-world scenarios, demonstrates significant improvements in current state-of-the-art layout models without using any human annotation.
Paper Structure (26 sections, 10 equations, 8 figures, 4 tables)

This paper contains 26 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: By leveraging multiple estimates from a pre-trained model as presented in panel (a), Our solution leverages a ray-casting data aggregation process to estimate geometry-aware pseudo-labels for self-training, as depicted in panel (b), i.e., pseudo-labels that encompass a comprehensive representation of the room geometry. In comparison with previous solutions, as presented in (c), where multiple estimations are processed on the image domain without geometry reasoning, our approach excels in defining better pseudo-labels, especially for occluded geometries, highlighting the significance of our contribution.
  • Figure 2: Self-training Pipeline. We use a pre-trained model $f_{\Theta}$ to estimate multiple layouts $\mathbf{y}_i$ from multiple views $I_i$ in an unseen scene. We aggregate all noisy estimates $\mathbf{Y}^{(0)}=\mathrm{concat}(\{\mathbf{y}_i\}_{i:n})$ using our proposed Multi-cycle ray-casting process. Then, we sample our pseudo-label $\mathbf{\bar{y}}_i$ at the camera position $\mathbf{T}_i$ from the filtered set of layouts $\mathbf{Y}_i^{(m)}$. Finally, we constraint our self-training optimization using our proposed Weighted-distance loss $\mathcal{L}_{WD}$.
  • Figure 3: Ray-Casting: In panel (a), different ray directions from different camera views are shown. Note that due to occluded geometries and different camera positions, the probability distribution along a ray may vary significantly. In panel (b), one of our constraints to handle occluded geometries is depicted, i.e., sampling a nearby region along the ray to define $P_{\Omega_r}$. In Panel (c), we sample a pseudo-label (magnet contour) from a filtered layout boundary $\mathbf{Y}^{(m)}_j$ at the camera $\mathbf{T}_j$ by using $\mathrm{min}(\cdot)$ function to sample the non-occluded points on the rays (see \ref{['sec:ray_casting']}).
  • Figure 4: Weighted-distance function: In panel (a), we illustrate our proposed weighted-distance function $\omega_i$ that prioritizes the farthest geometries in the scene for self-training. In panel (b), under the same scale as (a), we show the $L1$ loss between our proposed pseudo-label and the model estimation. Note that the $L1$ loss evaluation presents a small range w.r.t $\omega_i$ and does not aim at any particular region in the scene. In Panel (c), we present our pseudo-label (magenta line) and the model estimation (green line).
  • Figure 5: Qualitative comparisons of estimated pseudo-labels. We show a BEV projection of all pseudo-labels for the scene: (a) pseudo-labels from 360-MLC 360_mlc, (b) pseudo-labels from our proposed multi-cycle ray-casting, and (c) Point cloud for reference purposes.
  • ...and 3 more figures