Table of Contents
Fetching ...

Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models

Osher Rafaeli, Tal Svoray, Ariel Nahlieli

Abstract

Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.

Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models

Abstract

Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.

Paper Structure

This paper contains 21 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of the proposed height completion framework. Given an RGB image, a monocular depth estimation (MDE) model produces a relative height map, while a self-supervised DINOv3 encoder extracts dense semantic features. Incomplete metric priors are used together with a change mask to identify regions that require completion. A lightweight MLP predicts spatially varying scale and shift parameters that transform relative height predictions into metric heights. During test-time adaptation (TTA), LoRA is applied to the DINOv3 attention layers and the MLP is optimized using the available metric priors, enabling local calibration of the relative height prediction. The final output is a refined metric DSM where missing regions are reconstructed while preserving structural consistency.
  • Figure 2: Comparison of semantic feature representations extracted from DINOv3. Left: input RGB aerial image. Middle: native DINOv3 patch tokens obtained using the standard patch grid. Right: dense strided ViT tokens obtained using overlapping patch extraction, significantly increasing spatial sampling density and yielding smoother, higher-resolution semantic feature maps that better preserve fine spatial structures.
  • Figure 3: The experimental dataset consists of two domains: (1) the Denver NAIP dataset and (2) the Worldview-3 Fresno dataset. Both datasets include high-resolution RGB imagery, ground-truth normalized DSM (nDSM), and three levels of prior incompleteness (25%, 50%, and 75%) for evaluation.
  • Figure 4: Comparison of DSM completion methods. From left to right: RGB image, Bilinear interpolation, Global affine fitting, Locally Weighted Linear Regression (LWLR), Spatial kNN propagation, DINOv3 feature-space kNN, and the proposed Prior2DSM. Each block corresponds to different levels of missing prior data (25%, 50%, and 75%). For each case, the ground truth (GT) DSM and the degraded prior input are shown on the left, followed by the reconstruction results and the corresponding absolute error maps. While simple interpolation and global fitting produce oversmoothed surfaces and spatial artifacts, semantic feature-based approaches better preserve structural details. The proposed framework consistently achieves more accurate reconstructions with lower errors, particularly under severe missing data conditions. Height: black = low, yellow = high. Error: blue = low, red = high.
  • Figure 5: Prior-based DSM completion results: From left to right: RGB image, Marigold-DC, PriorDA, and the proposed framework. Each row corresponds to increasing levels of missing data (25%, 50%, and 75%). For each case, the ground truth (GT) DSM and the degraded prior input are shown on the left. The second row in each block shows the corresponding absolute error maps. The framework Prior2DSM produces sharper building structures and lower reconstruction errors compared to existing prior-based approaches, particularly under severe missing data conditions. Height: black = low, yellow = high. Error: blue = low, red = high.
  • ...and 2 more figures