Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion
Anith Selvakumar, Manasa Bharadwaj
TL;DR
This work tackles monocular indoor Semantic Scene Completion (SSC), addressing depth, scale, and occlusion ambiguities that arise when inferring a 3D semantic occupancy map from a single RGB image. GenFuSE introduces a Virtual Multiview Generation and Fusion framework that employs scene-constrained virtual cameras and a Multiview Fusion Adaptor (MVFA) to fuse predictions from multiple synthetic views into a unified 3D occupancy map $P \in \mathbb{R}^{C \times H \times W \times D}$. The MVFA embeds per-view predictions into an $E$-dimensional space, augments them with Spatial and View Position Encodings, and processes them with a Transformer to achieve global context fusion. On NYUv2, GenFuSE yields IoU gains up to $2.8\%$ for Scene Completion and $4.9\%$ for Semantic Scene Completion when paired with existing SSC networks, and reveals a Novelty-Consistency tradeoff that guides the design of synthesized views for robust 3D completion.
Abstract
Monocular Indoor Semantic Scene Completion (SSC) aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene, inferring spatial layout and object categories from 2D image cues. The challenge of this task arises from the depth, scale, and shape ambiguities that emerge when transforming a 2D image into 3D space, particularly within the complex and often heavily occluded environments of indoor scenes. Current SSC methods often struggle with these ambiguities, resulting in distorted or missing object representations. To overcome these limitations, we introduce an innovative approach that leverages novel view synthesis and multiview fusion. Specifically, we demonstrate how virtual cameras can be placed around the scene to emulate multiview inputs that enhance contextual scene information. We also introduce a Multiview Fusion Adaptor (MVFA) to effectively combine the multiview 3D scene predictions into a unified 3D semantic occupancy map. Finally, we identify and study the inherent limitation of generative techniques when applied to SSC, specifically the Novelty-Consistency tradeoff. Our system, GenFuSE, demonstrates IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset. This work introduces GenFuSE as a standard framework for advancing monocular SSC with synthesized inputs.
