Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation
Keonhee Han, Dominik Muhle, Felix Wimbauer, Daniel Cremers
TL;DR
This work tackles single-view scene completion by first learning a fully self-supervised multi-view density field fusion (MVBTS) from multiple posed images to recover geometry in occluded regions. It then transfers this rich multi-view knowledge to a lightweight single-view model (KDBTS) via knowledge distillation, enabling accurate single-image scene completion without requiring pose data at inference. The approach achieves state-of-the-art occupancy prediction, particularly behind occluders, while maintaining competitive depth estimates, as demonstrated on KITTI and KITTI-360. By combining self-supervised multi-view reconstruction with distillation into a compact single-view model, the method offers improved 3D reasoning with practical deployment potential in robotics and autonomous driving. Limitations include the static-scene assumption and increased inference cost for multi-view operation, suggesting directions for modeling dynamics and reducing runtime.
Abstract
Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and, more recently, depth map predictions only focus on the visible parts of a scene, the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of neural radiance fields (NeRFs), implicit representations also became popular for scene completion by predicting so-called density fields. Unlike explicit approaches. e.g. voxel-based methods, density fields also allow for accurate depth prediction and novel-view synthesis via image-based rendering. In this work, we propose to fuse the scene reconstruction from multiple images and distill this knowledge into a more accurate single-view scene reconstruction. To this end, we propose Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed images, trained fully self-supervised only from image data. Using knowledge distillation, we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occupancy prediction, especially in occluded regions.
