Table of Contents
Fetching ...

Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization

Yan Huang, Yongyi Su, Xin Lin, Le Zhang, Xun Xu

TL;DR

WeSTAR tackles the generalization challenge of monocular depth estimation foundation models by integrating dense self-training with semantic-aware hierarchical depth normalization, cost-efficient weak ordinal supervision, and regularized LoRA-based adaptation. The framework mitigates confirmation bias from pseudo-labels and preserves pre-trained knowledge through a weight-regularization term, while leveraging semantic instance context via SAM2 for robust normalization. Empirical results across NYU, KITTI, Sintel, DIODE, NuScenes, and corrupted variants demonstrate state-of-the-art generalization, especially under severe distribution shifts and across diverse backbones. This approach offers a practical pathway to adapt depth foundation models with minimal target-domain data, enabling more reliable depth perception in real-world conditions.

Abstract

The emergence of foundation models has substantially advanced zero-shot generalization in monocular depth estimation (MDE), as exemplified by the Depth Anything series. However, given access to some data from downstream tasks, a natural question arises: can the performance of these models be further improved? To this end, we propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization, designed to enhance the robustness of MDE foundation models in unseen and diverse domains. We first adopt a dense self-training objective as the primary source of structural self-supervision. To further improve robustness, we introduce semantically-aware hierarchical normalization, which exploits instance-level segmentation maps to perform more stable and multi-scale structural normalization. Beyond dense supervision, we introduce a cost-efficient weak supervision in the form of pairwise ordinal depth annotations to further guide the adaptation process, which enforces informative ordinal constraints to mitigate local topological errors. Finally, a weight regularization loss is employed to anchor the LoRA updates, ensuring training stability and preserving the model's generalizable knowledge. Extensive experiments on both realistic and corrupted out-of-distribution datasets under diverse and challenging scenarios demonstrate that WeSTAR consistently improves generalization and achieves state-of-the-art performance across a wide range of benchmarks.

Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization

TL;DR

WeSTAR tackles the generalization challenge of monocular depth estimation foundation models by integrating dense self-training with semantic-aware hierarchical depth normalization, cost-efficient weak ordinal supervision, and regularized LoRA-based adaptation. The framework mitigates confirmation bias from pseudo-labels and preserves pre-trained knowledge through a weight-regularization term, while leveraging semantic instance context via SAM2 for robust normalization. Empirical results across NYU, KITTI, Sintel, DIODE, NuScenes, and corrupted variants demonstrate state-of-the-art generalization, especially under severe distribution shifts and across diverse backbones. This approach offers a practical pathway to adapt depth foundation models with minimal target-domain data, enabling more reliable depth perception in real-world conditions.

Abstract

The emergence of foundation models has substantially advanced zero-shot generalization in monocular depth estimation (MDE), as exemplified by the Depth Anything series. However, given access to some data from downstream tasks, a natural question arises: can the performance of these models be further improved? To this end, we propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization, designed to enhance the robustness of MDE foundation models in unseen and diverse domains. We first adopt a dense self-training objective as the primary source of structural self-supervision. To further improve robustness, we introduce semantically-aware hierarchical normalization, which exploits instance-level segmentation maps to perform more stable and multi-scale structural normalization. Beyond dense supervision, we introduce a cost-efficient weak supervision in the form of pairwise ordinal depth annotations to further guide the adaptation process, which enforces informative ordinal constraints to mitigate local topological errors. Finally, a weight regularization loss is employed to anchor the LoRA updates, ensuring training stability and preserving the model's generalizable knowledge. Extensive experiments on both realistic and corrupted out-of-distribution datasets under diverse and challenging scenarios demonstrate that WeSTAR consistently improves generalization and achieves state-of-the-art performance across a wide range of benchmarks.

Paper Structure

This paper contains 15 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of MDE on unseen test samples with zero-shot results of model and model after adaptation.
  • Figure 2: Illustration of the overall framework. Two augmentations are applied and the teacher model $\tilde{\Theta}$ generates pseudo labels. Self-training is regularized by model weights consistency and optionally weak labels to adapt pre-trained foundation model.
  • Figure 3: Qualitative results on some selected examples.
  • Figure 4: Comparative Performance over Training Epochs on Sintel.