Table of Contents
Fetching ...

XoFTR: Cross-modal Feature Matching Transformer

Önder Tuzcuoğlu, Aybora Köksal, Buğra Sofu, Sinan Kalkan, A. Aydın Alatan

TL;DR

XoFTR tackles the challenging problem of cross-modal local feature matching between visible and thermal images by introducing a two-stage training regime that combines masked image modeling pre-training with augmentation-based fine-tuning on pseudo-thermal data. The method employs a coarse-to-fine pipeline with sub-pixel refinement, enabling robust matching across scale, viewpoint, and texture differences, and integrates one-to-many and one-to-one assignments to improve reliability. A new METU-VisTIR dataset is proposed to evaluate cross-modal matching under diverse weather and viewpoint conditions, and extensive experiments show state-of-the-art performance on pose and homography estimation benchmarks, with strong ablations confirming the effectiveness of each component. The work has practical implications for cross-modal localization and mapping in outdoor environments where thermal imaging offers robustness to lighting and weather, while aligning with existing deep-learning frameworks for efficiency and scalability.

Abstract

We introduce, XoFTR, a cross-modal cross-view method for local feature matching between thermal infrared (TIR) and visible images. Unlike visible images, TIR images are less susceptible to adverse lighting and weather conditions but present difficulties in matching due to significant texture and intensity differences. Current hand-crafted and learning-based methods for visible-TIR matching fall short in handling viewpoint, scale, and texture diversities. To address this, XoFTR incorporates masked image modeling pre-training and fine-tuning with pseudo-thermal image augmentation to handle the modality differences. Additionally, we introduce a refined matching pipeline that adjusts for scale discrepancies and enhances match reliability through sub-pixel level refinement. To validate our approach, we collect a comprehensive visible-thermal dataset, and show that our method outperforms existing methods on many benchmarks.

XoFTR: Cross-modal Feature Matching Transformer

TL;DR

XoFTR tackles the challenging problem of cross-modal local feature matching between visible and thermal images by introducing a two-stage training regime that combines masked image modeling pre-training with augmentation-based fine-tuning on pseudo-thermal data. The method employs a coarse-to-fine pipeline with sub-pixel refinement, enabling robust matching across scale, viewpoint, and texture differences, and integrates one-to-many and one-to-one assignments to improve reliability. A new METU-VisTIR dataset is proposed to evaluate cross-modal matching under diverse weather and viewpoint conditions, and extensive experiments show state-of-the-art performance on pose and homography estimation benchmarks, with strong ablations confirming the effectiveness of each component. The work has practical implications for cross-modal localization and mapping in outdoor environments where thermal imaging offers robustness to lighting and weather, while aligning with existing deep-learning frameworks for efficiency and scalability.

Abstract

We introduce, XoFTR, a cross-modal cross-view method for local feature matching between thermal infrared (TIR) and visible images. Unlike visible images, TIR images are less susceptible to adverse lighting and weather conditions but present difficulties in matching due to significant texture and intensity differences. Current hand-crafted and learning-based methods for visible-TIR matching fall short in handling viewpoint, scale, and texture diversities. To address this, XoFTR incorporates masked image modeling pre-training and fine-tuning with pseudo-thermal image augmentation to handle the modality differences. Additionally, we introduce a refined matching pipeline that adjusts for scale discrepancies and enhances match reliability through sub-pixel level refinement. To validate our approach, we collect a comprehensive visible-thermal dataset, and show that our method outperforms existing methods on many benchmarks.
Paper Structure (16 sections, 12 equations, 7 figures, 5 tables)

This paper contains 16 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Our XoFTR provides significant improvements over LoFTR sun2021loftr on visible and thermal image pairs. Only the inlier matches after RANSAC are shown, and matches with epipolar error below $5\times 10^{-4}$ are drawn in green.
  • Figure 2: Overview of the proposed method. XoFTR consists of four modules: (1) A CNN backbone which extracts features at scales of $1/8$, $1/4$, and $1/2$. (2) The coarse-level matching module (CLMM), which correlates the features and creates coarse-level match predictions (at $1/8$ scale), allowing one-to-one and one-to-many assignment. (3) The fine-level matching module (FLMM), which re-matches coarse-level match predictions at $1/2$ scale and creates fine-level match predictions, filtering low-confidence matches. (4) The sub-pixel refinement module (SPRM) for refining fine-level match predictions at the sub-pixel level with a regression mechanism, preventing a point in one image from matching with multiple points in the other image.
  • Figure 3: Visualization of reconstructed images using MIM pretext task. Input images are from hwang2015multispectral.
  • Figure 4: Pseudo-thermal image samples generated with the proposed augmentation method together with real counterparts.
  • Figure 5: Visualization of some images from our dataset.
  • ...and 2 more figures