Table of Contents
Fetching ...

Semantic Scene Completion with Multi-Feature Data Balancing Network

Mona Alawadh, Mahesan Niranjan, Hansung Kim

TL;DR

MDBNet tackles Semantic Scene Completion from a single RGB-D input by fusing 2D RGB semantics with 3D geometry through a dual-head architecture. The 2D branch uses a Segformer encoder for RGB features, projected into 3D and fused with F-TSDF in a 3D CNN branch that employs Identity Transformed within full pre-activation Residual Modules (ITRM) and a Tanh activation on identity paths. A combined, reweighted loss balances 2D and 3D supervision, leveraging K-means-derived voxel weights to address intra-class diversity and inter-class ambiguity, with uncertainty quantified via k-fold cross-validation. MDBNet achieves state-of-the-art mIoU on NYUv2 and NYUCAD, demonstrating improved occlusion handling and robustness across fusion strategies, and offering a competitive balance between accuracy and computational considerations compared with heavier priors like SPAwN.

Abstract

Semantic Scene Completion (SSC) is a critical task in computer vision, that utilized in applications such as virtual reality (VR). SSC aims to construct detailed 3D models from partial views by transforming a single 2D image into a 3D representation, assigning each voxel a semantic label. The main challenge lies in completing 3D volumes with limited information, compounded by data imbalance, inter-class ambiguity, and intra-class diversity in indoor scenes. To address this, we propose the Multi-Feature Data Balancing Network (MDBNet), a dual-head model for RGB and depth data (F-TSDF) inputs. Our hybrid encoder-decoder architecture with identity transformation in a pre-activation residual module (ITRM) effectively manages diverse signals within F-TSDF. We evaluate RGB feature fusion strategies and use a combined loss function cross entropy for 2D RGB features and weighted cross-entropy for 3D SSC predictions. MDBNet results surpass comparable state-of-the-art (SOTA) methods on NYU datasets, demonstrating the effectiveness of our approach.

Semantic Scene Completion with Multi-Feature Data Balancing Network

TL;DR

MDBNet tackles Semantic Scene Completion from a single RGB-D input by fusing 2D RGB semantics with 3D geometry through a dual-head architecture. The 2D branch uses a Segformer encoder for RGB features, projected into 3D and fused with F-TSDF in a 3D CNN branch that employs Identity Transformed within full pre-activation Residual Modules (ITRM) and a Tanh activation on identity paths. A combined, reweighted loss balances 2D and 3D supervision, leveraging K-means-derived voxel weights to address intra-class diversity and inter-class ambiguity, with uncertainty quantified via k-fold cross-validation. MDBNet achieves state-of-the-art mIoU on NYUv2 and NYUCAD, demonstrating improved occlusion handling and robustness across fusion strategies, and offering a competitive balance between accuracy and computational considerations compared with heavier priors like SPAwN.

Abstract

Semantic Scene Completion (SSC) is a critical task in computer vision, that utilized in applications such as virtual reality (VR). SSC aims to construct detailed 3D models from partial views by transforming a single 2D image into a 3D representation, assigning each voxel a semantic label. The main challenge lies in completing 3D volumes with limited information, compounded by data imbalance, inter-class ambiguity, and intra-class diversity in indoor scenes. To address this, we propose the Multi-Feature Data Balancing Network (MDBNet), a dual-head model for RGB and depth data (F-TSDF) inputs. Our hybrid encoder-decoder architecture with identity transformation in a pre-activation residual module (ITRM) effectively manages diverse signals within F-TSDF. We evaluate RGB feature fusion strategies and use a combined loss function cross entropy for 2D RGB features and weighted cross-entropy for 3D SSC predictions. MDBNet results surpass comparable state-of-the-art (SOTA) methods on NYU datasets, demonstrating the effectiveness of our approach.

Paper Structure

This paper contains 21 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: MDBNet is a dual-head network that processes 2D RGB semantics via a pre-trained Segformer with 2D-3D projection (include PCR blocks) and geometric data via a 3D CNN with ITRM blocks. The network optimises a combined loss, which is a weighted sum of 3D loss and 2D semantics loss.
  • Figure 2: Comparison of SSC results on the NYUv2 dataset: SSCNet (depth maps) vs. MDBNet (RGB-D). Objects are color-coded, with circles marking key differences between GT and predictions.
  • Figure 3: SSC results with different components on NYUCAD dataset. From left to right: (1) RGB-D input; (2) GT; (3) combined loss with re-sampling; (4) combined loss with re-weighting; (5) combined loss (using re-weighting) with ITRM blocks. Objects are color-coded, with circles highlighting key differences between GT and predictions.