Table of Contents
Fetching ...

UFM: Unified Feature Matching Pre-training with Multi-Modal Image Assistants

Yide Di, Yun Liao, Hao Zhou, Kaijun Zhu, Qing Duan, Junhui Liu, Mingyu Lu

TL;DR

UFM tackles unified feature matching across diverse image modals by introducing a Multimodal Image Assistant (MIA) Transformer that augments a generic FFN with modal-specific assistants and shared attention mechanisms. A data augmentation pipeline and a staged pre-training strategy address data sparsity and modality imbalance, enabling effective fine-tuning for both same-modal and cross-modal tasks. The approach employs a coarse-to-fine dense matching framework with epipolar and cycle-consistency losses, achieving strong generalization across benchmarks while remaining computationally efficient. The results demonstrate competitive or superior performance on both same- and cross-modal matching, with practical implications for multimodal vision tasks and downstream applications.

Abstract

Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at:https://github.com/LiaoYun0x0/UFM.

UFM: Unified Feature Matching Pre-training with Multi-Modal Image Assistants

TL;DR

UFM tackles unified feature matching across diverse image modals by introducing a Multimodal Image Assistant (MIA) Transformer that augments a generic FFN with modal-specific assistants and shared attention mechanisms. A data augmentation pipeline and a staged pre-training strategy address data sparsity and modality imbalance, enabling effective fine-tuning for both same-modal and cross-modal tasks. The approach employs a coarse-to-fine dense matching framework with epipolar and cycle-consistency losses, achieving strong generalization across benchmarks while remaining computationally efficient. The results demonstrate competitive or superior performance on both same- and cross-modal matching, with practical implications for multimodal vision tasks and downstream applications.

Abstract

Image feature matching, a foundational task in computer vision, remains challenging for multimodal image applications, often necessitating intricate training on specific datasets. In this paper, we introduce a Unified Feature Matching pre-trained model (UFM) designed to address feature matching challenges across a wide spectrum of modal images. We present Multimodal Image Assistant (MIA) transformers, finely tunable structures adept at handling diverse feature matching problems. UFM exhibits versatility in addressing both feature matching tasks within the same modal and those across different modals. Additionally, we propose a data augmentation algorithm and a staged pre-training strategy to effectively tackle challenges arising from sparse data in specific modals and imbalanced modal datasets. Experimental results demonstrate that UFM excels in generalization and performance across various feature matching tasks. The code will be released at:https://github.com/LiaoYun0x0/UFM.

Paper Structure

This paper contains 17 sections, 15 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Illustration of feature matching of UFM. When working with specific data, the pre-trained backbone is frozen, and only the corresponding modal assistants need to be fine-tuned for feature matching. The multi-modal assistants contain both same-modal assistants and different-modal matching assistants.
  • Figure 2: The overview of UFM and the MIA Transformer.The UFM model encompasses all the processes involved in image enhancement, feature extraction, coarse matching, and fine matching. The MIA Transformer is utilized in both the coarse and fine matching stages.
  • Figure 3: Data augmentation is applied both geometrically and in terms of intensity. Geometrically, the images are mirrored, flipped, rotated, and randomly cropped. For intensity augmentation, random noise is added, and random masking is applied. Finally, a square matrix (GT matrix) is used to represent the correspondence of matching points between the two images. The GT_matrix is a square matrix of $N\times N$ dimensions. $\operatorname{GT}(i, j)$ represents the element of the ith row and the jth column in the GT matrix. The shown input image pairs take optical and SAR image pairs as an example.
  • Figure 4: Illustration of Pre-Training. The pre-training of NIR and SAR images is taken here as an example. Pre-training consists of 3 stages: (1) pre-train the general FFN, (2) pre-train all X-X assistants, and (3) pre-train all X-Y assistants.
  • Figure 5: Fine-Tuning on same-model feature matching tasks. The X-FFN and Y-FFN represent the assistants of any two kinds of pre-trained different modal images in the second stage of Fig. 4. The fine-tuning of the X-modal image and the fine-tuning of the Y-modal image are independent of each other.
  • ...and 8 more figures