XoFTR: Cross-modal Feature Matching Transformer
Önder Tuzcuoğlu, Aybora Köksal, Buğra Sofu, Sinan Kalkan, A. Aydın Alatan
TL;DR
XoFTR tackles the challenging problem of cross-modal local feature matching between visible and thermal images by introducing a two-stage training regime that combines masked image modeling pre-training with augmentation-based fine-tuning on pseudo-thermal data. The method employs a coarse-to-fine pipeline with sub-pixel refinement, enabling robust matching across scale, viewpoint, and texture differences, and integrates one-to-many and one-to-one assignments to improve reliability. A new METU-VisTIR dataset is proposed to evaluate cross-modal matching under diverse weather and viewpoint conditions, and extensive experiments show state-of-the-art performance on pose and homography estimation benchmarks, with strong ablations confirming the effectiveness of each component. The work has practical implications for cross-modal localization and mapping in outdoor environments where thermal imaging offers robustness to lighting and weather, while aligning with existing deep-learning frameworks for efficiency and scalability.
Abstract
We introduce, XoFTR, a cross-modal cross-view method for local feature matching between thermal infrared (TIR) and visible images. Unlike visible images, TIR images are less susceptible to adverse lighting and weather conditions but present difficulties in matching due to significant texture and intensity differences. Current hand-crafted and learning-based methods for visible-TIR matching fall short in handling viewpoint, scale, and texture diversities. To address this, XoFTR incorporates masked image modeling pre-training and fine-tuning with pseudo-thermal image augmentation to handle the modality differences. Additionally, we introduce a refined matching pipeline that adjusts for scale discrepancies and enhances match reliability through sub-pixel level refinement. To validate our approach, we collect a comprehensive visible-thermal dataset, and show that our method outperforms existing methods on many benchmarks.
