Table of Contents
Fetching ...

Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation

Jing Cao, Kui Jiang, Shenyi Li, Xiaocheng Feng, Yong Huang

TL;DR

SEC-Depth addresses the fragility of self-supervised monocular depth estimation in adverse weather by introducing a latency-model-based self-evolution contrastive learning framework. It builds a dynamic queue of historical models to generate negative samples and couples a novel interval-based depth distribution constraint with a self-evolution loss $L_c$, leveraging $P_A$, $P_P$, and $P_N$ distributions and Jensen-Shannon divergence. The approach is plug-and-play, compatible with existing baselines like MonoViT and PlaneDepth, and delivers strong zero-shot generalization across WeatherKITTI, DrivingStereo, Cityscapes variants, and more, achieving notable improvements over both standard baselines and prior robust methods. Overall, SEC-Depth provides a practical, dataset-agnostic path to robust depth perception in autonomous systems without requiring architectural changes or annotated data.

Abstract

Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.

Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation

TL;DR

SEC-Depth addresses the fragility of self-supervised monocular depth estimation in adverse weather by introducing a latency-model-based self-evolution contrastive learning framework. It builds a dynamic queue of historical models to generate negative samples and couples a novel interval-based depth distribution constraint with a self-evolution loss , leveraging , , and distributions and Jensen-Shannon divergence. The approach is plug-and-play, compatible with existing baselines like MonoViT and PlaneDepth, and delivers strong zero-shot generalization across WeatherKITTI, DrivingStereo, Cityscapes variants, and more, achieving notable improvements over both standard baselines and prior robust methods. Overall, SEC-Depth provides a practical, dataset-agnostic path to robust depth perception in autonomous systems without requiring architectural changes or annotated data.

Abstract

Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.

Paper Structure

This paper contains 32 sections, 12 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of latency models evolution. The left figure shows the relationship between training step $t$, training loss, and parameter update ratio, where a decreasing update ratio indicates model convergence. Models at different optimization steps $t$ within the parameter space are defined as latency models $F_t$. The right figure presents the depth outputs of latency models under adverse weather conditions. We leverage these evolving latency models to construct negative samples for the contrastive learning, which encourages the depth model to learn robust representations from its own historical information.
  • Figure 2: Illustration of our proposed pipeline. (a) Self-supervised learning is conducted on clean images. When augmented samples are introduced, the loss is computed using Equation (4). (b) During training, we maintain a model queue of size $j$, with parameters updated according to Algorithm 1. (c) As the self-supervised model continues to train, the parameters stored in the model queue gradually converge toward suboptimal states. Our self-evolution contrastive loss is designed to effectively leverage this parametric evolution.
  • Figure 3: Illustration of the advantage of our interval-based depth modeling strategy. (1) The model can reliably distinguish samples with significant overall depth differences. (2) Our strategy can better distinguish samples with local depth differences.
  • Figure 4: (a) Qualitative results of DrivingStereo and Cityscapes dataset based on the MonoViT baseline. (b) Qualitative results of DrivingStereo and Cityscapes dataset based on the PlaneDepth baseline.
  • Figure 5: The relationship between GPU memory usage, training time and the number of negative models