Table of Contents
Fetching ...

MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation

Xingxing Zuo, Nikhil Ranganathan, Connor Lee, Georgia Gkioxari, Soon-Jo Chung

TL;DR

MonoTher-Depth tackles the challenge of thermal monocular depth estimation with scarce labeled data by transferring priors from a large RGB MDE model through confidence-aware distillation. The method introduces a confidence predictor that weights RGB-to-thermal depth guidance based on cross-modal and depth-consistency metadata, and uses sub-pixel warping to align depths across modalities. It demonstrates strong gains on MS^2 and ViViD++ datasets, achieving a zero-shot AbsRel improvement of $22.88\%$ in scenarios without ground-truth depth, and supports self-supervised fine-tuning while remaining robust to imperfect RGB–thermal alignment. This approach enables accurate thermal depth estimation in challenging conditions and supports real-world robotic deployments without requiring tightly co-registered RGB–T data. Overall, MonoTher-Depth advances reliable thermal perception by effectively leveraging large RGB priors in a confidence-aware, accessibly trainable framework.

Abstract

Monocular depth estimation (MDE) from thermal images is a crucial technology for robotic systems operating in challenging conditions such as fog, smoke, and low light. The limited availability of labeled thermal data constrains the generalization capabilities of thermal MDE models compared to foundational RGB MDE models, which benefit from datasets of millions of images across diverse scenarios. To address this challenge, we introduce a novel pipeline that enhances thermal MDE through knowledge distillation from a versatile RGB MDE model. Our approach features a confidence-aware distillation method that utilizes the predicted confidence of the RGB MDE to selectively strengthen the thermal MDE model, capitalizing on the strengths of the RGB model while mitigating its weaknesses. Our method significantly improves the accuracy of the thermal MDE, independent of the availability of labeled depth supervision, and greatly expands its applicability to new scenarios. In our experiments on new scenarios without labeled depth, the proposed confidence-aware distillation method reduces the absolute relative error of thermal MDE by 22.88\% compared to the baseline without distillation.

MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation

TL;DR

MonoTher-Depth tackles the challenge of thermal monocular depth estimation with scarce labeled data by transferring priors from a large RGB MDE model through confidence-aware distillation. The method introduces a confidence predictor that weights RGB-to-thermal depth guidance based on cross-modal and depth-consistency metadata, and uses sub-pixel warping to align depths across modalities. It demonstrates strong gains on MS^2 and ViViD++ datasets, achieving a zero-shot AbsRel improvement of in scenarios without ground-truth depth, and supports self-supervised fine-tuning while remaining robust to imperfect RGB–thermal alignment. This approach enables accurate thermal depth estimation in challenging conditions and supports real-world robotic deployments without requiring tightly co-registered RGB–T data. Overall, MonoTher-Depth advances reliable thermal perception by effectively leveraging large RGB priors in a confidence-aware, accessibly trainable framework.

Abstract

Monocular depth estimation (MDE) from thermal images is a crucial technology for robotic systems operating in challenging conditions such as fog, smoke, and low light. The limited availability of labeled thermal data constrains the generalization capabilities of thermal MDE models compared to foundational RGB MDE models, which benefit from datasets of millions of images across diverse scenarios. To address this challenge, we introduce a novel pipeline that enhances thermal MDE through knowledge distillation from a versatile RGB MDE model. Our approach features a confidence-aware distillation method that utilizes the predicted confidence of the RGB MDE to selectively strengthen the thermal MDE model, capitalizing on the strengths of the RGB model while mitigating its weaknesses. Our method significantly improves the accuracy of the thermal MDE, independent of the availability of labeled depth supervision, and greatly expands its applicability to new scenarios. In our experiments on new scenarios without labeled depth, the proposed confidence-aware distillation method reduces the absolute relative error of thermal MDE by 22.88\% compared to the baseline without distillation.

Paper Structure

This paper contains 17 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: System architecture of MonoTher-Depth. Our framework enhances the thermal MDE model by leveraging learned priors from an RGB model. To harness the strengths and mitigate the weaknesses of the RGB teacher model, we predict the confidence of its depth output $\hat{\mathbf{W}}_r$ using curated metadata that includes both thermal and RGB information. Whether ground-truth (GT) depth is available or not, our system improves thermal MDE through confidence-aware distillation by minimizing the confidence-weighted depth discrepancy between the predicted RGB depth $\hat{\mathbf{D}}_r$ and the wrapped thermal depth $\breve{\mathbf{D}}_{tr}$.
  • Figure 2: Pipeline of the confidence-aware distillation. The predicted confidence $\hat{\mathbf{W}}_r$ of the RGB depth $\hat{\mathbf{D}}_r$ plays a key role in both the negative log-likelihood loss $L_{\rm{nll}}$ (\ref{['eq:nll']}) and the consistency loss $L_{\rm{con}}$ (\ref{['eq:con']}). The $L_{\text{nll}}$ loss propagates gradients back to the confidence network, while the $L_{\text{con}}$ loss propagates gradients to the warped thermal depth $\breve{\mathbf{D}}_{tr}$. Gradient flow is stopped along all other paths.
  • Figure 3: Monocular Depth Estimation on MS$^2$ datasetshin2023deep. Top to bottom: normalized thermal image, predicted thermal depth, RGB image, and predicted RGB depth. Red boxes highlight significant differences between the thermal and RGB depth predictions. Left to right: every two columns showcase the rainy, day, and night conditions, respectively.
  • Figure 4: Predicted confidence and depth error on MS$^2$ dataset shin2023deep. Left to right: depth error overlaid on the image, confidence overlaid on the image, and predicted RGB depth.
  • Figure 5: Monocular Depth Estimation on ViViD++ dataset lee2022vivid++. From left to right: the RGB image, our predicted RGB depth, the normalized thermal image, our predicted thermal depth with zero-shot (Thermal-Depth-ZS), our predicted thermal depth after self-supervised fine-tuning (Thermal-Depth-SSFT), the depth error of Thermal-Depth-ZS, and the depth error of Thermal-Depth-SSFT. The red boxes highlight areas with a significant decrease in error after self-supervised fine-tuning.
  • ...and 1 more figures