Table of Contents
Fetching ...

Unsupervised Continual Semantic Adaptation through Neural Rendering

Zhizheng Liu, Francesco Milano, Jonas Frey, Roland Siegwart, Hermann Blum, Cesar Cadena

TL;DR

The paper tackles unsupervised continual semantic adaptation of segmentation models across multiple real-world scenes by introducing a scene-specific Semantic-NeRF that fuses segmentation predictions and renders view-consistent pseudo-labels. It enables joint 2D-3D training, stores NeRFs in long-term memory for replay from novel viewpoints, and mitigates forgetting through NeRF-based experience replay. On ScanNet, the approach outperforms voxel-based baselines and a leading unsupervised domain adaptation method, with improved per-scene adaptation and stronger knowledge retention across scenes. The work demonstrates practical impact for continual deployment of perception systems, offering efficient memory usage and flexible rendering from arbitrary viewpoints to support online adaptation.

Abstract

An increasing amount of applications rely on data-driven models that are deployed for perception tasks across a sequence of scenes. Due to the mismatch between training and deployment data, adapting the model on the new scenes is often crucial to obtain good performance. In this work, we study continual multi-scene adaptation for the task of semantic segmentation, assuming that no ground-truth labels are available during deployment and that performance on the previous scenes should be maintained. We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model and then using the view-consistent rendered semantic labels as pseudo-labels to adapt the model. Through joint training with the segmentation model, the Semantic-NeRF model effectively enables 2D-3D knowledge transfer. Furthermore, due to its compact size, it can be stored in a long-term memory and subsequently used to render data from arbitrary viewpoints to reduce forgetting. We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.

Unsupervised Continual Semantic Adaptation through Neural Rendering

TL;DR

The paper tackles unsupervised continual semantic adaptation of segmentation models across multiple real-world scenes by introducing a scene-specific Semantic-NeRF that fuses segmentation predictions and renders view-consistent pseudo-labels. It enables joint 2D-3D training, stores NeRFs in long-term memory for replay from novel viewpoints, and mitigates forgetting through NeRF-based experience replay. On ScanNet, the approach outperforms voxel-based baselines and a leading unsupervised domain adaptation method, with improved per-scene adaptation and stronger knowledge retention across scenes. The work demonstrates practical impact for continual deployment of perception systems, offering efficient memory usage and flexible rendering from arbitrary viewpoints to support online adaptation.

Abstract

An increasing amount of applications rely on data-driven models that are deployed for perception tasks across a sequence of scenes. Due to the mismatch between training and deployment data, adapting the model on the new scenes is often crucial to obtain good performance. In this work, we study continual multi-scene adaptation for the task of semantic segmentation, assuming that no ground-truth labels are available during deployment and that performance on the previous scenes should be maintained. We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model and then using the view-consistent rendered semantic labels as pseudo-labels to adapt the model. Through joint training with the segmentation model, the Semantic-NeRF model effectively enables 2D-3D knowledge transfer. Furthermore, due to its compact size, it can be stored in a long-term memory and subsequently used to render data from arbitrary viewpoints to reduce forgetting. We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.
Paper Structure (26 sections, 5 equations, 7 figures, 9 tables)

This paper contains 26 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Effect of joint training over the pseudo-labels and the predictions of the segmentation network (DeepLab). Color-coded labels are overlaid on the corresponding color images. Black pixels in the ground-truth labels denote missing annotation. First scene: The noisy predictions of DeepLab are corrected and the segmentation results conform much better to the geometry of the scene. Second scene: The geometric details can be better recovered even for the legs of the table. Third scene: By enforcing multi-view consistency, the initial wrong predictions on the wall are corrected through the predictions from other views. Note that the obtained labels adhere accurately to the scene geometry, often even better than in the ground-truth annotations.
  • Figure 2: Effect on the rendered depth and semantics of depth supervision and of the modification to the semantic loss. Black pixels in the ground-truth depth and ground-truth semantics denote respectively missing depth measurement and missing semantic annotation.
  • Figure 3: Visualization of the novel viewpoints used for adaptation in Sec. \ref{['sec:appendix_replay_novel_viewpoints']} for two example scenes (Scene $5$, left side, and Scene $6$, right side). The viewpoints $\mathbf{P}_j$ used for training and the novel viewpoints $\hat{\mathbf{P}}_j$ used for "replay" are shown in green and red, respectively.
  • Figure 4: Memory footprint of the different methods as a function of the number of the previous scenes. Please refer to the text and to Tab. \ref{['tab:memory_footprint']} for a detailed explanation. We use solid lines for the number of scenes used in our experiments.
  • Figure 5: Comparison of example pseudo-labels obtained on the training views by the different methods.
  • ...and 2 more figures