Table of Contents
Fetching ...

Thermal Image Refinement with Depth Estimation using Recurrent Networks for Monocular ORB-SLAM3

Hürkan Şahin, Huy Xuan Pham, Van Huyen Dang, Alper Yegenoglu, Erdal Kayacan

Abstract

Autonomous navigation in GPS-denied and visually degraded environments remains challenging for unmanned aerial vehicles (UAVs). To this end, we investigate the use of a monocular thermal camera as a standalone sensor on a UAV platform for real-time depth estimation and simultaneous localization and mapping (SLAM). To extract depth information from thermal images, we propose a novel pipeline employing a lightweight supervised network with recurrent blocks (RBs) integrated to capture temporal dependencies, enabling more robust predictions. The network combines lightweight convolutional backbones with a thermal refinement network (T-RefNet) to refine raw thermal inputs and enhance feature visibility. The refined thermal images and predicted depth maps are integrated into ORB-SLAM3, enabling thermal-only localization. Unlike previous methods, the network is trained on a custom non-radiometric dataset, obviating the need for high-cost radiometric thermal cameras. Experimental results on datasets and UAV flights demonstrate competitive depth accuracy and robust SLAM performance under low-light conditions. On the radiometric VIVID++ (indoor-dark) dataset, our method achieves an absolute relative error of approximately 0.06, compared to baselines exceeding 0.11. In our non-radiometric indoor set, baseline errors remain above 0.24, whereas our approach remains below 0.10. Thermal-only ORB-SLAM3 maintains a mean trajectory error under 0.4 m.

Thermal Image Refinement with Depth Estimation using Recurrent Networks for Monocular ORB-SLAM3

Abstract

Autonomous navigation in GPS-denied and visually degraded environments remains challenging for unmanned aerial vehicles (UAVs). To this end, we investigate the use of a monocular thermal camera as a standalone sensor on a UAV platform for real-time depth estimation and simultaneous localization and mapping (SLAM). To extract depth information from thermal images, we propose a novel pipeline employing a lightweight supervised network with recurrent blocks (RBs) integrated to capture temporal dependencies, enabling more robust predictions. The network combines lightweight convolutional backbones with a thermal refinement network (T-RefNet) to refine raw thermal inputs and enhance feature visibility. The refined thermal images and predicted depth maps are integrated into ORB-SLAM3, enabling thermal-only localization. Unlike previous methods, the network is trained on a custom non-radiometric dataset, obviating the need for high-cost radiometric thermal cameras. Experimental results on datasets and UAV flights demonstrate competitive depth accuracy and robust SLAM performance under low-light conditions. On the radiometric VIVID++ (indoor-dark) dataset, our method achieves an absolute relative error of approximately 0.06, compared to baselines exceeding 0.11. In our non-radiometric indoor set, baseline errors remain above 0.24, whereas our approach remains below 0.10. Thermal-only ORB-SLAM3 maintains a mean trajectory error under 0.4 m.
Paper Structure (11 sections, 3 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 3 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the proposed thermal depth estimation pipeline. A raw 16-bit long-wave infrared (LWIR) image is first enhanced by the tref module, producing both an enhanced input for depth prediction and a color-mapped image for robust ORB-SLAM3 feature extraction. The encoder backbone extracts multi-scale features, which are processed by RB (ConvGRU convgru or RC jaeger2001echo) to enforce temporal consistency. Finally, the decoder outputs dense depth maps and enhanced thermal images integrated into ORB-SLAM3 orbslam3 for robust feature extraction and metric-scale, temporally consistent tracking.
  • Figure 2: Recent thermal navigation frameworks found in the literature span ground, handheld, and indoor platforms, across datasets from urban driving to parking-lot and outdoor road scenes. Representative approaches include feature/semantics-aware tracking and point–line SLAM [11], self-supervised depth–ego-motion [12], NUC handling [11,13,15], LWIR-based trajectory prediction with MPC [14], and road-segmentation–based scale recovery [15].
  • Figure 3: Comparison of the thermal image preprocessing methods. (a) Radiometric vs. non-radiometric thermal cameras at different TBB values. Solid lines represent radiometric outputs, while dashed lines indicate non-radiometric behavior. (b) Thermal image enhancement techniques: i) Raw input suffers from noise that disrupts gradients; ii) Gaussian smoothing reduces noise but blurs edges; iii) CLAHE boosts local contrast but introduces spurious keypoints; iv) T-RefNet preserves edges while denoising, yielding stable features for SLAM.
  • Figure 4: Qualitative comparison across two datasets. Top: VIVID++; bottom: our dataset. Each row shows two temporally adjacent frames. Columns: (a) RGB, (b) thermal, (c) thermal-aligned ground-truth depth, (d) Shin et al. ShinMaximizing, (e) DepthAnything-V2 yang2024depthv2 (RGB-only), (f) Our representative proposed model with RC.
  • Figure 5: Feature tracking results of ORB-SLAM3 using different image inputs: RGB images (top row), raw 8-bit thermal images (middle row), and T-RefNet enhanced thermal images (bottom row). While RGB features degrade under low-light indoor conditions, raw thermal inputs suffer from noise and low contrast. In contrast, T-RefNet outputs provide more stable and repeatable features, leading to improved tracking robustness.
  • ...and 3 more figures