Table of Contents
Fetching ...

Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation

Keonhee Han, Dominik Muhle, Felix Wimbauer, Daniel Cremers

TL;DR

This work tackles single-view scene completion by first learning a fully self-supervised multi-view density field fusion (MVBTS) from multiple posed images to recover geometry in occluded regions. It then transfers this rich multi-view knowledge to a lightweight single-view model (KDBTS) via knowledge distillation, enabling accurate single-image scene completion without requiring pose data at inference. The approach achieves state-of-the-art occupancy prediction, particularly behind occluders, while maintaining competitive depth estimates, as demonstrated on KITTI and KITTI-360. By combining self-supervised multi-view reconstruction with distillation into a compact single-view model, the method offers improved 3D reasoning with practical deployment potential in robotics and autonomous driving. Limitations include the static-scene assumption and increased inference cost for multi-view operation, suggesting directions for modeling dynamics and reducing runtime.

Abstract

Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and, more recently, depth map predictions only focus on the visible parts of a scene, the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of neural radiance fields (NeRFs), implicit representations also became popular for scene completion by predicting so-called density fields. Unlike explicit approaches. e.g. voxel-based methods, density fields also allow for accurate depth prediction and novel-view synthesis via image-based rendering. In this work, we propose to fuse the scene reconstruction from multiple images and distill this knowledge into a more accurate single-view scene reconstruction. To this end, we propose Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed images, trained fully self-supervised only from image data. Using knowledge distillation, we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occupancy prediction, especially in occluded regions.

Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation

TL;DR

This work tackles single-view scene completion by first learning a fully self-supervised multi-view density field fusion (MVBTS) from multiple posed images to recover geometry in occluded regions. It then transfers this rich multi-view knowledge to a lightweight single-view model (KDBTS) via knowledge distillation, enabling accurate single-image scene completion without requiring pose data at inference. The approach achieves state-of-the-art occupancy prediction, particularly behind occluders, while maintaining competitive depth estimates, as demonstrated on KITTI and KITTI-360. By combining self-supervised multi-view reconstruction with distillation into a compact single-view model, the method offers improved 3D reasoning with practical deployment potential in robotics and autonomous driving. Limitations include the static-scene assumption and increased inference cost for multi-view operation, suggesting directions for modeling dynamics and reducing runtime.

Abstract

Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and, more recently, depth map predictions only focus on the visible parts of a scene, the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of neural radiance fields (NeRFs), implicit representations also became popular for scene completion by predicting so-called density fields. Unlike explicit approaches. e.g. voxel-based methods, density fields also allow for accurate depth prediction and novel-view synthesis via image-based rendering. In this work, we propose to fuse the scene reconstruction from multiple images and distill this knowledge into a more accurate single-view scene reconstruction. To this end, we propose Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed images, trained fully self-supervised only from image data. Using knowledge distillation, we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occupancy prediction, especially in occluded regions.
Paper Structure (23 sections, 14 equations, 15 figures, 9 tables)

This paper contains 23 sections, 14 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Knowledge Distillation from Multi-View to Single-View. We propose to boost single-view scene completion by exploiting additional information from multiple images. a) we first train a novel multi-view scene reconstruction algorithm that is able to fuse density fields from multiple images in a fully self-supervised manner. b) we then employ knowledge distillation to directly supervise a state-of-the-art single-view reconstruction model in 3D to boost its performance. https://keonhee-han.github.io/publications/kdbts/
  • Figure 2: Overview. Given multiple input images $\textbf{I}_k$ ($k \in I_D$) an encoder-decoder backbone predicts per image a pixel-aligned feature map $\textbf{F}_k$ (top left). The feature $f_{\textbf{u}}$ of pixel $\textbf{u}$ encodes the occupancy and confidence distribution of a ray cast through pixel $\textbf{u}$. Given a 3D point $\textbf{x}$ and its projections $\textbf{u}^\prime_k$ into the different camera images, we extract the corresponding feature vectors and positional embeddings $\gamma(d, \textbf{u})$. A multi-view network $\phi_\text{MV}$ decodes all feature vectors into a density prediction $\sigma_\textbf{x}$ (middle). Together with color samples from another image ($\textbf{I}_R$), this can be used to render novel views in an image-based rendering pipeline. $\textbf{I}_R$ is not required to be close to the input images, as our method can predict density in occluded regions. See \ref{['fig:training_setup']} for more details about the importance of covering the whole scene. We train our networks by using a photometric consistency loss of an image $\textbf{I}_L$ close to $\textbf{I}_R$ (right).
  • Figure 3: Training Setup. Given an input view ($\textbf{I}_\text{1}$, blue) we want to reconstruct the scene, including partially occluded objects such as the green car. To learn to reconstruct both the free space behind the blue car and the surface of the green car, we try to render a pixel of the green view ($\textbf{I}_\text{L}$) by casting a ray and sampling points on it (dotted line). We project the points into the blue view to estimate the density and into the red view ($\textbf{I}_\text{R}$) to sample colors. The pixel of $\textbf{I}_\text{L}$ is only reconstructed correctly if $\textbf{I}_\text{1}$ estimates the correct density of the scene, although the green car and the free space next to it is occluded. In our multi-view method, we propose to use more views, e.g. the yellow view $\textbf{I}_\text{2}$ to aggregate more information about the scene to better density predictions. In this example, the yellow view has a slightly better visibility on the green car.
  • Figure 4: Knowledge Distillation. To improve the single-view (SV) density field reconstruction, we propose leveraging knowledge distillation from the multi-view (MV) predictions. Both $\phi_\text{SV}$ and $\phi_\text{MV}$ make use of the same encoder-decoder architecture and, therefore, the same feature vectors. The knowledge distillation loss $\mathcal{L}_\text{kd}$ pushes the $\phi_\text{SV}$mlp to predict the same density as $\phi_\text{MV}$ while relying only upon a single feature vector. The stop gradient operator prevents $\mathcal{L}_\text{kd}$ influencing $\phi_\text{MV}$.
  • Figure 5: Density Fields. Top-down rendering of the density fields for an area of $x = \left[-9m,9m\right]$, $y = \left[0m,1m\right]$, $z = \left[3m,23m\right]$. Images are taken from KITTI-360 (top half) and KITTI (bottom half) with profiles coming from models trained on KITTI-360. Every model except for MVBTS $(S, T)$ and IBRnet wang2021ibrnet get the same input data. Our MVBTS can predict accurate geometry even in distant regions for both a single image and multiple images. KDBTS learns to recreate the accurate density structure from MVBTS. Both models reduce the amount of shadows produced by bts wimbauer2023behind, especially in distant regions. They also produce cleaner boundaries for close-by objects. Note that KDBTS has a smaller model capacity than MVBTS $(mono)$. $*$: changed sensitivity for visualization purposes, $\dagger$: retrained on KITTI-360.
  • ...and 10 more figures