SC3EF: A Joint Self-Correlation and Cross-Correspondence Estimation Framework for Visible and Thermal Image Registration
Xi Tong, Xing Luo, Jiangxin Yang, Yanpeng Cao
TL;DR
This paper tackles the challenging problem of registering visible and thermal RGB-T images, which exhibit significant modality gaps. It introduces SC$^{3}$EF, a joint framework that combines local high-frequency feature extraction with global low-frequency self-correlations to estimate dense cross-modal correspondences, refined through a hierarchical optical flow decoder. The approach integrates four purpose-built modules—LFE, GSCE, LCCE, and GCCE—alongside a differentiable flow refinement process, achieving state-of-the-art results on KAIST, RoadScene, MFNet, and M3FD while showing strong generalization to RGB-N and RGB-D. The work demonstrates robustness to large parallax, occlusions, and adverse weather, with practical relevance for ADAS and traffic monitoring applications.
Abstract
Multispectral imaging plays a critical role in a range of intelligent transportation applications, including advanced driver assistance systems (ADAS), traffic monitoring, and night vision. However, accurate visible and thermal (RGB-T) image registration poses a significant challenge due to the considerable modality differences. In this paper, we present a novel joint Self-Correlation and Cross-Correspondence Estimation Framework (SC3EF), leveraging both local representative features and global contextual cues to effectively generate RGB-T correspondences. For this purpose, we design a convolution-transformer-based pipeline to extract local representative features and encode global correlations of intra-modality for inter-modality correspondence estimation between unaligned visible and thermal images. After merging the local and global correspondence estimation results, we further employ a hierarchical optical flow estimation decoder to progressively refine the estimated dense correspondence maps. Extensive experiments demonstrate the effectiveness of our proposed method, outperforming the current state-of-the-art (SOTA) methods on representative RGB-T datasets. Furthermore, it also shows competitive generalization capabilities across challenging scenarios, including large parallax, severe occlusions, adverse weather, and other cross-modal datasets (e.g., RGB-N and RGB-D).
