Table of Contents
Fetching ...

Multi-Modality Driven LoRA for Adverse Condition Depth Estimation

Guanglei Yang, Rui Tian, Yongqiang Zhang, Zhun Zhong, Yongqiang Li, Wangmeng Zuo

TL;DR

This work tackles adverse condition depth estimation (ACDE) where labeled data is scarce and cross-modal alignment is weak. It introduces Multi-Modality Driven LoRA (MMD-LoRA), combining Prompt Driven Domain Alignment (PDDA) with Visual-Text Consistent Contrastive Learning (VTCCL) and using low-rank adapters in the image encoder to bridge domain gaps with a small parameter budget, formalized as $W = W_0 + BA$ with $B \in \mathbb{R}^{d\times r}$, $A \in \mathbb{R}^{r\times k}$ and $r \ll \min(d,k)$. A two-stage training protocol first learns target-domain visuals and multimodal alignment, then injects LoRA blocks into the image encoder’s self-attention layers and finetunes depth estimation using ground-truth depth and captions. Experiments on nuScenes and Oxford RobotCar demonstrate state-of-the-art depth accuracy under night and rain, validating robustness and data-efficiency without the need for additional target-domain images.

Abstract

The autonomous driving community is increasingly focused on addressing corner case problems, particularly those related to ensuring driving safety under adverse conditions (e.g., nighttime, fog, rain). To this end, the task of Adverse Condition Depth Estimation (ACDE) has gained significant attention. Previous approaches in ACDE have primarily relied on generative models, which necessitate additional target images to convert the sunny condition into adverse weather, or learnable parameters for feature augmentation to adapt domain gaps, resulting in increased model complexity and tuning efforts. Furthermore, unlike CLIP-based methods where textual and visual features have been pre-aligned, depth estimation models lack sufficient alignment between multimodal features, hindering coherent understanding under adverse conditions. To address these limitations, we propose Multi-Modality Driven LoRA (MMD-LoRA), which leverages low-rank adaptation matrices for efficient fine-tuning from source-domain to target-domain. It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning(VTCCL). During PDDA, the image encoder with MMD-LoRA generates target-domain visual representations, supervised by alignment loss that the source-target difference between language and image should be equal. Meanwhile, VTCCL bridges the gap between textual features from CLIP and visual features from diffusion model, pushing apart different weather representations (vision and text) and bringing together similar ones. Through extensive experiments, the proposed method achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets, underscoring robustness and efficiency in adapting to varied adverse environments.

Multi-Modality Driven LoRA for Adverse Condition Depth Estimation

TL;DR

This work tackles adverse condition depth estimation (ACDE) where labeled data is scarce and cross-modal alignment is weak. It introduces Multi-Modality Driven LoRA (MMD-LoRA), combining Prompt Driven Domain Alignment (PDDA) with Visual-Text Consistent Contrastive Learning (VTCCL) and using low-rank adapters in the image encoder to bridge domain gaps with a small parameter budget, formalized as with , and . A two-stage training protocol first learns target-domain visuals and multimodal alignment, then injects LoRA blocks into the image encoder’s self-attention layers and finetunes depth estimation using ground-truth depth and captions. Experiments on nuScenes and Oxford RobotCar demonstrate state-of-the-art depth accuracy under night and rain, validating robustness and data-efficiency without the need for additional target-domain images.

Abstract

The autonomous driving community is increasingly focused on addressing corner case problems, particularly those related to ensuring driving safety under adverse conditions (e.g., nighttime, fog, rain). To this end, the task of Adverse Condition Depth Estimation (ACDE) has gained significant attention. Previous approaches in ACDE have primarily relied on generative models, which necessitate additional target images to convert the sunny condition into adverse weather, or learnable parameters for feature augmentation to adapt domain gaps, resulting in increased model complexity and tuning efforts. Furthermore, unlike CLIP-based methods where textual and visual features have been pre-aligned, depth estimation models lack sufficient alignment between multimodal features, hindering coherent understanding under adverse conditions. To address these limitations, we propose Multi-Modality Driven LoRA (MMD-LoRA), which leverages low-rank adaptation matrices for efficient fine-tuning from source-domain to target-domain. It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning(VTCCL). During PDDA, the image encoder with MMD-LoRA generates target-domain visual representations, supervised by alignment loss that the source-target difference between language and image should be equal. Meanwhile, VTCCL bridges the gap between textual features from CLIP and visual features from diffusion model, pushing apart different weather representations (vision and text) and bringing together similar ones. Through extensive experiments, the proposed method achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets, underscoring robustness and efficiency in adapting to varied adverse environments.
Paper Structure (15 sections, 5 equations, 5 figures, 5 tables)

This paper contains 15 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of the baseline depth estimation EVP with learned augmentation and learned LoRA on the nuScenes validation set. The y-axis is the cosine similarity between the estimated unseen target-domain visual representation and real target-domain visual representation, and the x-axis denotes the depth estimation performance in $d_1$ for the night scene. The bubble size denotes the number of parameters(M). Best be viewed in color and zoomed in.
  • Figure 2: The overview of the MMD-LoRA pipeline including the pre-training step and training step. In the pre-training step, MMD-LoRA captures accurate target-domain visual features and achieves robust multimodal alignment based on the multi-modal learning and contrastive learning paradigm. In the training step, we freeze the trained LoRA to inject low-rank decomposition matrices into ’q’,’k’,’v’, ’proj’ layers of the image encoder self-attention and further optimize depth estimator based on the ground-truth depth map.
  • Figure 3: Qualitative results of our proposed MMD-LoRA and the previous SOTA depth estimation method on the nuScenes validation sets. The first column denotes the original image. The second, third and fourth column denote the depth estimation results of Monodepth2, md4allDD and ours MMD-LoRA respectively. The final column indicates the ground-truth depth maps.
  • Figure 4: Qualitative results of MMD-LoRA and the previous SOTA depth estimation method on the Robotcar test set. The first column denotes the original image. The second, third and fourth column denote the depth estimation results of Monodepth2, md4allDD and ours MMD-LoRA respectively. The final column indicates the ground-truth depth maps.
  • Figure 5: Ablation visualization of our proposed MMD-LoRA with PDDA and VTCCL. The first column denotes the original image. The second, third and fourth column denote baseline(i.e. EVP EVP), MMD-LoRA with PDDA, MMD-LoRA with PDDA and VTCCL. The final column indicates the ground-truth depth maps.