Table of Contents
Fetching ...

ProtoDepth: Unsupervised Continual Depth Completion with Prototypes

Patrick Rim, Hyoungseob Park, S. Gangopadhyay, Ziyao Zeng, Younjoon Chung, Alex Wong

TL;DR

ProtoDepth addresses unsupervised depth completion under non-stationary data by freezing a pretrained backbone and learning domain-specific prototypes that bias latent features through a global multiplicative term and a local additive bias. The additive bias is constructed via attention over a learned prototype bank, with keys mapped by a projection and a stop-gradient operation, yielding a compact, per-domain adaptation mechanism: $\hat{X} = A \odot X + B$, where $B$ is computed from $Q$, $K$, and $P$ as $b = \text{softmax}(QK^{T}/\sqrt{c})P$ and $K = \text{StopGrad}(P)W$. To handle domain-agnostic inference, the method learns domain descriptors $r_k$ and uses cosine similarity with input descriptors $s$ to select the appropriate prototype set, optimizing an additional term $\ell_{dr}$ that promotes discriminability between domains. Empirically, ProtoDepth and its agnostic variant ProtoDepth-A reduce forgetting by large margins across indoor and outdoor sequences and achieve state-of-the-art performance in unsupervised continual depth completion while adding only a small fraction of parameters, with applicability to both CNNs and transformers. The approach offers a practical, architecture-agnostic solution for continual learning in multimodal 3D reconstruction tasks. $\mathcal{L} = w_{ph}\ell_{ph}+w_{sz}\ell_{sz}+w_{sm}\ell_{sm}$ and $\hat{X} = A \odot X + B$ are central to the method, while $r_k$ and $s_k$ enable domain-aware prototype selection at test time.

Abstract

We present ProtoDepth, a novel prototype-based approach for continual learning of unsupervised depth completion, the multimodal 3D reconstruction task of predicting dense depth maps from RGB images and sparse point clouds. The unsupervised learning paradigm is well-suited for continual learning, as ground truth is not needed. However, when training on new non-stationary distributions, depth completion models will catastrophically forget previously learned information. We address forgetting by learning prototype sets that adapt the latent features of a frozen pretrained model to new domains. Since the original weights are not modified, ProtoDepth does not forget when test-time domain identity is known. To extend ProtoDepth to the challenging setting where the test-time domain identity is withheld, we propose to learn domain descriptors that enable the model to select the appropriate prototype set for inference. We evaluate ProtoDepth on benchmark dataset sequences, where we reduce forgetting compared to baselines by 52.2% for indoor and 53.2% for outdoor to achieve the state of the art.

ProtoDepth: Unsupervised Continual Depth Completion with Prototypes

TL;DR

ProtoDepth addresses unsupervised depth completion under non-stationary data by freezing a pretrained backbone and learning domain-specific prototypes that bias latent features through a global multiplicative term and a local additive bias. The additive bias is constructed via attention over a learned prototype bank, with keys mapped by a projection and a stop-gradient operation, yielding a compact, per-domain adaptation mechanism: , where is computed from , , and as and . To handle domain-agnostic inference, the method learns domain descriptors and uses cosine similarity with input descriptors to select the appropriate prototype set, optimizing an additional term that promotes discriminability between domains. Empirically, ProtoDepth and its agnostic variant ProtoDepth-A reduce forgetting by large margins across indoor and outdoor sequences and achieve state-of-the-art performance in unsupervised continual depth completion while adding only a small fraction of parameters, with applicability to both CNNs and transformers. The approach offers a practical, architecture-agnostic solution for continual learning in multimodal 3D reconstruction tasks. and are central to the method, while and enable domain-aware prototype selection at test time.

Abstract

We present ProtoDepth, a novel prototype-based approach for continual learning of unsupervised depth completion, the multimodal 3D reconstruction task of predicting dense depth maps from RGB images and sparse point clouds. The unsupervised learning paradigm is well-suited for continual learning, as ground truth is not needed. However, when training on new non-stationary distributions, depth completion models will catastrophically forget previously learned information. We address forgetting by learning prototype sets that adapt the latent features of a frozen pretrained model to new domains. Since the original weights are not modified, ProtoDepth does not forget when test-time domain identity is known. To extend ProtoDepth to the challenging setting where the test-time domain identity is withheld, we propose to learn domain descriptors that enable the model to select the appropriate prototype set for inference. We evaluate ProtoDepth on benchmark dataset sequences, where we reduce forgetting compared to baselines by 52.2% for indoor and 53.2% for outdoor to achieve the state of the art.

Paper Structure

This paper contains 21 sections, 14 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Results on KITTI validation set during continual training on outdoor dataset sequence (KITTI $\rightarrow$ Waymo $\rightarrow$ VKITTI).
  • Figure 2: Overview of ProtoDepth. (a) In the agnostic setting, a prototype set is selected by maximizing the cosine similarity between an input sample descriptor and the learned domain descriptors. In the incremental setting, the domain identity is known. (b) At inference, the similarity between the frozen queries and the keys of the selected prototype set determines how the learned prototypes contribute as local (additive) biases to the latent features. Additionally, a global (multiplicative) bias is applied using a $1\times1$ depthwise convolution.
  • Figure 3: Qualitative comparison of ProtoDepth and baseline methods using VOICED on KITTI after continual training on Waymo. (a) Input sample from KITTI, (b) Baseline methods exhibit significant forgetting, particularly for small-surface-area objects (e.g., street signs and lamp posts) where sparse depth is limited, and photometric priors from KITTI are critical. In contrast, ProtoDepth produces high-fidelity depth predictions, effectively mitigating forgetting despite the large domain gap between KITTI and Waymo.
  • Figure 4: t-SNE plot of KBNet sample descriptors for indoor validation datasets (NYUv2, ScanNet, VOID) and their respective domain descriptors learned during training in the agnostic setting. While most sample descriptors align closely with their respective domain descriptors, some overlap enables cross-domain generalization, improving performance in challenging scenarios.
  • Figure 5: Qualitative comparison (1 of 2) of ProtoDepth and baseline methods using FusionNet on NYUv2 after continual training on ScanNet. Top row: Input sample from NYUv2. Following rows: Output depth and error maps (relative to ground-truth) of same sample from NYUv2 after continual training on ScanNet using each continual learning method.
  • ...and 1 more figures