Table of Contents
Fetching ...

Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion

Anith Selvakumar, Manasa Bharadwaj

TL;DR

This work tackles monocular indoor Semantic Scene Completion (SSC), addressing depth, scale, and occlusion ambiguities that arise when inferring a 3D semantic occupancy map from a single RGB image. GenFuSE introduces a Virtual Multiview Generation and Fusion framework that employs scene-constrained virtual cameras and a Multiview Fusion Adaptor (MVFA) to fuse predictions from multiple synthetic views into a unified 3D occupancy map $P \in \mathbb{R}^{C \times H \times W \times D}$. The MVFA embeds per-view predictions into an $E$-dimensional space, augments them with Spatial and View Position Encodings, and processes them with a Transformer to achieve global context fusion. On NYUv2, GenFuSE yields IoU gains up to $2.8\%$ for Scene Completion and $4.9\%$ for Semantic Scene Completion when paired with existing SSC networks, and reveals a Novelty-Consistency tradeoff that guides the design of synthesized views for robust 3D completion.

Abstract

Monocular Indoor Semantic Scene Completion (SSC) aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene, inferring spatial layout and object categories from 2D image cues. The challenge of this task arises from the depth, scale, and shape ambiguities that emerge when transforming a 2D image into 3D space, particularly within the complex and often heavily occluded environments of indoor scenes. Current SSC methods often struggle with these ambiguities, resulting in distorted or missing object representations. To overcome these limitations, we introduce an innovative approach that leverages novel view synthesis and multiview fusion. Specifically, we demonstrate how virtual cameras can be placed around the scene to emulate multiview inputs that enhance contextual scene information. We also introduce a Multiview Fusion Adaptor (MVFA) to effectively combine the multiview 3D scene predictions into a unified 3D semantic occupancy map. Finally, we identify and study the inherent limitation of generative techniques when applied to SSC, specifically the Novelty-Consistency tradeoff. Our system, GenFuSE, demonstrates IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset. This work introduces GenFuSE as a standard framework for advancing monocular SSC with synthesized inputs.

Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion

TL;DR

This work tackles monocular indoor Semantic Scene Completion (SSC), addressing depth, scale, and occlusion ambiguities that arise when inferring a 3D semantic occupancy map from a single RGB image. GenFuSE introduces a Virtual Multiview Generation and Fusion framework that employs scene-constrained virtual cameras and a Multiview Fusion Adaptor (MVFA) to fuse predictions from multiple synthetic views into a unified 3D occupancy map . The MVFA embeds per-view predictions into an -dimensional space, augments them with Spatial and View Position Encodings, and processes them with a Transformer to achieve global context fusion. On NYUv2, GenFuSE yields IoU gains up to for Scene Completion and for Semantic Scene Completion when paired with existing SSC networks, and reveals a Novelty-Consistency tradeoff that guides the design of synthesized views for robust 3D completion.

Abstract

Monocular Indoor Semantic Scene Completion (SSC) aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene, inferring spatial layout and object categories from 2D image cues. The challenge of this task arises from the depth, scale, and shape ambiguities that emerge when transforming a 2D image into 3D space, particularly within the complex and often heavily occluded environments of indoor scenes. Current SSC methods often struggle with these ambiguities, resulting in distorted or missing object representations. To overcome these limitations, we introduce an innovative approach that leverages novel view synthesis and multiview fusion. Specifically, we demonstrate how virtual cameras can be placed around the scene to emulate multiview inputs that enhance contextual scene information. We also introduce a Multiview Fusion Adaptor (MVFA) to effectively combine the multiview 3D scene predictions into a unified 3D semantic occupancy map. Finally, we identify and study the inherent limitation of generative techniques when applied to SSC, specifically the Novelty-Consistency tradeoff. Our system, GenFuSE, demonstrates IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset. This work introduces GenFuSE as a standard framework for advancing monocular SSC with synthesized inputs.

Paper Structure

This paper contains 18 sections, 12 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: High-level diagram of the proposed GenFuSE system. Novel View Synthesis techniques are used to generate context-relevant views that enhance predictions and reduce uncertainty of predictions of occluded regions.
  • Figure 2: Impact of occlusion on SSC from a single view (left) and the mitigation through novel view synthesis (right). Novel views provide line-of-sight access to occluded regions, enhancing object discovery and refining 3D shape representations.
  • Figure 3: System level diagram of GenFuSE. The Multiview Generation pipeline performs novel view synthesis to generate additional views that are passed into the SSC pipeline for prediction. These predictions are then passed into the Multiview Fusion Adaptor (MFVA) that fuse the multiview predictions into a refined 3D representation.