Table of Contents
Fetching ...

SC3EF: A Joint Self-Correlation and Cross-Correspondence Estimation Framework for Visible and Thermal Image Registration

Xi Tong, Xing Luo, Jiangxin Yang, Yanpeng Cao

TL;DR

This paper tackles the challenging problem of registering visible and thermal RGB-T images, which exhibit significant modality gaps. It introduces SC$^{3}$EF, a joint framework that combines local high-frequency feature extraction with global low-frequency self-correlations to estimate dense cross-modal correspondences, refined through a hierarchical optical flow decoder. The approach integrates four purpose-built modules—LFE, GSCE, LCCE, and GCCE—alongside a differentiable flow refinement process, achieving state-of-the-art results on KAIST, RoadScene, MFNet, and M3FD while showing strong generalization to RGB-N and RGB-D. The work demonstrates robustness to large parallax, occlusions, and adverse weather, with practical relevance for ADAS and traffic monitoring applications.

Abstract

Multispectral imaging plays a critical role in a range of intelligent transportation applications, including advanced driver assistance systems (ADAS), traffic monitoring, and night vision. However, accurate visible and thermal (RGB-T) image registration poses a significant challenge due to the considerable modality differences. In this paper, we present a novel joint Self-Correlation and Cross-Correspondence Estimation Framework (SC3EF), leveraging both local representative features and global contextual cues to effectively generate RGB-T correspondences. For this purpose, we design a convolution-transformer-based pipeline to extract local representative features and encode global correlations of intra-modality for inter-modality correspondence estimation between unaligned visible and thermal images. After merging the local and global correspondence estimation results, we further employ a hierarchical optical flow estimation decoder to progressively refine the estimated dense correspondence maps. Extensive experiments demonstrate the effectiveness of our proposed method, outperforming the current state-of-the-art (SOTA) methods on representative RGB-T datasets. Furthermore, it also shows competitive generalization capabilities across challenging scenarios, including large parallax, severe occlusions, adverse weather, and other cross-modal datasets (e.g., RGB-N and RGB-D).

SC3EF: A Joint Self-Correlation and Cross-Correspondence Estimation Framework for Visible and Thermal Image Registration

TL;DR

This paper tackles the challenging problem of registering visible and thermal RGB-T images, which exhibit significant modality gaps. It introduces SCEF, a joint framework that combines local high-frequency feature extraction with global low-frequency self-correlations to estimate dense cross-modal correspondences, refined through a hierarchical optical flow decoder. The approach integrates four purpose-built modules—LFE, GSCE, LCCE, and GCCE—alongside a differentiable flow refinement process, achieving state-of-the-art results on KAIST, RoadScene, MFNet, and M3FD while showing strong generalization to RGB-N and RGB-D. The work demonstrates robustness to large parallax, occlusions, and adverse weather, with practical relevance for ADAS and traffic monitoring applications.

Abstract

Multispectral imaging plays a critical role in a range of intelligent transportation applications, including advanced driver assistance systems (ADAS), traffic monitoring, and night vision. However, accurate visible and thermal (RGB-T) image registration poses a significant challenge due to the considerable modality differences. In this paper, we present a novel joint Self-Correlation and Cross-Correspondence Estimation Framework (SC3EF), leveraging both local representative features and global contextual cues to effectively generate RGB-T correspondences. For this purpose, we design a convolution-transformer-based pipeline to extract local representative features and encode global correlations of intra-modality for inter-modality correspondence estimation between unaligned visible and thermal images. After merging the local and global correspondence estimation results, we further employ a hierarchical optical flow estimation decoder to progressively refine the estimated dense correspondence maps. Extensive experiments demonstrate the effectiveness of our proposed method, outperforming the current state-of-the-art (SOTA) methods on representative RGB-T datasets. Furthermore, it also shows competitive generalization capabilities across challenging scenarios, including large parallax, severe occlusions, adverse weather, and other cross-modal datasets (e.g., RGB-N and RGB-D).

Paper Structure

This paper contains 25 sections, 9 equations, 18 figures, 12 tables.

Figures (18)

  • Figure 1: An illustration of the use of both local representative features and global contextual cues for human observers to accurately identify correspondences between unaligned visible and thermal images.
  • Figure 2: Comparative results of a number of SOTA cross-modality registration methods including NeMAR arar2020unsupervised, UMF-CMGR wang2022unsupervised, CMF zhou2022promoting and our proposed SC$^{3}$EF method. It is observed that SC$^{3}$EF can achieve more accurate and robust registration of visible and thermal images with significant misalignment. Please zoom in to check the details highlighted in the yellow bounding boxes.
  • Figure 3: The overview of our proposed SC$^{3}$EF for RGB-T image registration, which contains four main stages including image decomposition, local feature & self-correlation extraction, local- & global- correspondence estimation, and hierarchical optical flow estimation. Key steps and registration performance (before and after) are zoomed in for clarity.
  • Figure 4: Detailed structures for the proposed (a) LFE and (b) GSCE modules.
  • Figure 5: Details of the proposed PPM that mainly consists of two stages: pyramid average-pooling and linear projection.
  • ...and 13 more figures