Table of Contents
Fetching ...

Depth Completion as Parameter-Efficient Test-Time Adaptation

Bingxin Ke, Qunjie Zhou, Jiahui Huang, Xuanchi Ren, Tianchang Shen, Konrad Schindler, Laura Leal-Taixé, Shengyu Huang

TL;DR

CAPA reframes depth completion as test-time, parameter-efficient adaptation of frozen 3D foundation models to sparse depth cues, grounding strong geometric priors with per-sample gradients. It implements two PEFT strategies, LoRA and Visual Prompt Tuning, to update only a tiny fraction of parameters, enabling efficient per-sample (and sequence-level for videos) fine-tuning while preserving the backbone. Across indoor and outdoor datasets, CAPA achieves state-of-the-art accuracy and superior temporal consistency, significantly beating baselines and improving the base model by 2–3x in error reduction. This approach enables robust, scene-specific depth reconstruction on standard hardware and paves the way for practical high-fidelity 3D mapping and regeneration tasks with minimal computation.

Abstract

We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.

Depth Completion as Parameter-Efficient Test-Time Adaptation

TL;DR

CAPA reframes depth completion as test-time, parameter-efficient adaptation of frozen 3D foundation models to sparse depth cues, grounding strong geometric priors with per-sample gradients. It implements two PEFT strategies, LoRA and Visual Prompt Tuning, to update only a tiny fraction of parameters, enabling efficient per-sample (and sequence-level for videos) fine-tuning while preserving the backbone. Across indoor and outdoor datasets, CAPA achieves state-of-the-art accuracy and superior temporal consistency, significantly beating baselines and improving the base model by 2–3x in error reduction. This approach enables robust, scene-specific depth reconstruction on standard hardware and paves the way for practical high-fidelity 3D mapping and regeneration tasks with minimal computation.

Abstract

We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.
Paper Structure (57 sections, 5 equations, 15 figures, 16 tables)

This paper contains 57 sections, 5 equations, 15 figures, 16 tables.

Figures (15)

  • Figure 1: CAPA performs depth completion by adapting geometric foundation models at test-time. By aligning the strong geometric prior of a base model with the sparse depth information of test samples, one obtains accurate reconstructions of scene layout and fine details, overcoming limitations of the base model such as distorted surfaces and misplaced objects, even under challenging conditions.
  • Figure 2: Method overview of CAPA. CAPA adapts 3D foundation models given sparse conditional depth ($\mathbf{C}$) by efficiently tuning its image encoder while keeping all pre-trained weights frozen. This is achieved by manipulating the attention layers via two methods: 1) CAPA$_\text{LoRA}$, which adds low-rank adapters to the projection weights, or 2) CAPA$_\text{VPT}$, which prepends tunable prompt tokens to the image token sequence before each attention layer.
  • Figure 3: Qualitative comparison on iBims, ScanNet, and Metropolis datasets. CAPA reliably recovers the full scene: in the first sample, despite 3D points being available only within $<5$ m, the geometry prior is calibrated sufficiently to reconstruct the full depth range, whereas most baselines focus on the well-constrained near-field; in the second sample, with only two observed points in the far-field, CAPA corrects the global geometry while preserving local structures; in the third sample, geometric structure is correctly recovered both near and far from the camera. Depth is color-coded near far, errors low high. Colored arrows mark corresponding locations across images.
  • Figure 4: Qualitative comparison of temporal consistency. Three consecutive frames are overlaid using ground truth poses. On the left, CAPA reconstructs coherent building shapes, while the baselines are visibly distorted.
  • Figure 5: CAPA results when applied to other base models.
  • ...and 10 more figures