Table of Contents
Fetching ...

A Light-weight Transformer-based Self-supervised Matching Network for Heterogeneous Images

Wang Zhang, Tingting Li, Yuntian Zhang, Gensheng Pei, Xiruo Jiang, Yazhou Yao

TL;DR

Heterogeneous remote sensing image matching between visible and NIR is challenged by nonlinear radiometric differences and limited labeled data. The paper introduces LTFormer, a light-weight pyramid vision transformer-based descriptor learned in a self-supervised manner through a triplet LT Loss that aligns anchor and positive patches while separating negatives. Key contributions include a self-supervised triplet data construction, a compact transformer-based descriptor with reduced channel counts, and the LT Loss that improves embedding discriminability. Empirically, LTFormer outperforms traditional handcrafted descriptors and competitively surpasses several learning-based methods, while operating with no annotated data and low computational demand, making it practically impactful for remote sensing image matching.

Abstract

Matching visible and near-infrared (NIR) images remains a significant challenge in remote sensing image fusion. The nonlinear radiometric differences between heterogeneous remote sensing images make the image matching task even more difficult. Deep learning has gained substantial attention in computer vision tasks in recent years. However, many methods rely on supervised learning and necessitate large amounts of annotated data. Nevertheless, annotated data is frequently limited in the field of remote sensing image matching. To address this challenge, this paper proposes a novel keypoint descriptor approach that obtains robust feature descriptors via a self-supervised matching network. A light-weight transformer network, termed as LTFormer, is designed to generate deep-level feature descriptors. Furthermore, we implement an innovative triplet loss function, LT Loss, to enhance the matching performance further. Our approach outperforms conventional hand-crafted local feature descriptors and proves equally competitive compared to state-of-the-art deep learning-based methods, even amidst the shortage of annotated data.

A Light-weight Transformer-based Self-supervised Matching Network for Heterogeneous Images

TL;DR

Heterogeneous remote sensing image matching between visible and NIR is challenged by nonlinear radiometric differences and limited labeled data. The paper introduces LTFormer, a light-weight pyramid vision transformer-based descriptor learned in a self-supervised manner through a triplet LT Loss that aligns anchor and positive patches while separating negatives. Key contributions include a self-supervised triplet data construction, a compact transformer-based descriptor with reduced channel counts, and the LT Loss that improves embedding discriminability. Empirically, LTFormer outperforms traditional handcrafted descriptors and competitively surpasses several learning-based methods, while operating with no annotated data and low computational demand, making it practically impactful for remote sensing image matching.

Abstract

Matching visible and near-infrared (NIR) images remains a significant challenge in remote sensing image fusion. The nonlinear radiometric differences between heterogeneous remote sensing images make the image matching task even more difficult. Deep learning has gained substantial attention in computer vision tasks in recent years. However, many methods rely on supervised learning and necessitate large amounts of annotated data. Nevertheless, annotated data is frequently limited in the field of remote sensing image matching. To address this challenge, this paper proposes a novel keypoint descriptor approach that obtains robust feature descriptors via a self-supervised matching network. A light-weight transformer network, termed as LTFormer, is designed to generate deep-level feature descriptors. Furthermore, we implement an innovative triplet loss function, LT Loss, to enhance the matching performance further. Our approach outperforms conventional hand-crafted local feature descriptors and proves equally competitive compared to state-of-the-art deep learning-based methods, even amidst the shortage of annotated data.
Paper Structure (23 sections, 5 equations, 6 figures, 4 tables)

This paper contains 23 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of our LTFormer framework. The feature point detector utilizes the default SIFT algorithm, while our model is employed to generate feature descriptors. Initiating self-supervised training begins with forming a triplet of these descriptors to obtain correspondences. In conclusion, this framework facilitates the extraction of robust deep feature descriptions to match keypoints between visible and near-infrared images.
  • Figure 2: Overview of the self-supervised training process. By using triplet descriptors, LT Loss is utilised in the feature space to bring anchor patches as close as possible to positive patches while moving away from negative patches.
  • Figure 3: Samples of The WHU-OPT-SAR dataset. The optical image is obtained by merging a visible image with a near-infrared image.
  • Figure 4: Sample triplex dataset showing triplex morphology after different homography transformations.
  • Figure 5: Qualitative comparison of LTFormer with traditional methods such as SIFT, ORB, AKAZE and BRISK.
  • ...and 1 more figures