Table of Contents
Fetching ...

MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Xingyi He, Hao Yu, Sida Peng, Dongli Tan, Zehong Shen, Hujun Bao, Xiaowei Zhou

TL;DR

The paper introduces a universal cross-modality image matching framework that scales pre-training with synthetic cross-modal signals and a mixture of multi-resource data to train detector-free transformers. By jointly leveraging multi-view geometry, video trajectories, and warping of single images, plus pixel-aligned cross-modal stimulus data, the approach yields high generalization to unseen cross-modality registration tasks across medical imaging, histopathology, remote sensing, and autonomous systems. Empirical results across nine datasets show large, consistent gains over state-of-the-art baselines, including improvements of up to hundreds of percent in certain metrics, with the single pre-trained model weight performing well without task-specific fine-tuning. The framework thus provides a scalable path to robust, high-precision cross-modality matching, enabling broader multi-modality analysis in science and engineering, while also outlining limitations and directions for future work such as fine-tuning for extreme viewpoint gaps.

Abstract

Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. In recent years, deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. However, when dealing with images captured under different imaging modalities that result in significant appearance changes, the performance of these algorithms often deteriorates due to the scarcity of annotated cross-modal training data. This limitation hinders applications in various fields that rely on multiple image modalities to obtain complementary information. To address this challenge, we propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks using the same network weight, substantially outperforming existing methods, whether designed for generalization or tailored for specific tasks. This advancement significantly enhances the applicability of image matching technologies across various scientific disciplines and paves the way for new applications in multi-modality human and artificial intelligence analysis and beyond.

MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

TL;DR

The paper introduces a universal cross-modality image matching framework that scales pre-training with synthetic cross-modal signals and a mixture of multi-resource data to train detector-free transformers. By jointly leveraging multi-view geometry, video trajectories, and warping of single images, plus pixel-aligned cross-modal stimulus data, the approach yields high generalization to unseen cross-modality registration tasks across medical imaging, histopathology, remote sensing, and autonomous systems. Empirical results across nine datasets show large, consistent gains over state-of-the-art baselines, including improvements of up to hundreds of percent in certain metrics, with the single pre-trained model weight performing well without task-specific fine-tuning. The framework thus provides a scalable path to robust, high-precision cross-modality matching, enabling broader multi-modality analysis in science and engineering, while also outlining limitations and directions for future work such as fine-tuning for extreme viewpoint gaps.

Abstract

Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. In recent years, deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. However, when dealing with images captured under different imaging modalities that result in significant appearance changes, the performance of these algorithms often deteriorates due to the scarcity of annotated cross-modal training data. This limitation hinders applications in various fields that rely on multiple image modalities to obtain complementary information. To address this challenge, we propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks using the same network weight, substantially outperforming existing methods, whether designed for generalization or tailored for specific tasks. This advancement significantly enhances the applicability of image matching technologies across various scientific disciplines and paves the way for new applications in multi-modality human and artificial intelligence analysis and beyond.
Paper Structure (50 sections, 7 equations, 7 figures, 3 tables)

This paper contains 50 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Capabilities of the image matching model pre-trained by our framework. Green lines indicate the identified corresponding pixel localizations between images. Using the same network weight, the detector-free matcher wang2024eloftr with Transformer exhibits impressive generalization abilities across extensive unseen real-world single- and cross-modality matching tasks, benefiting diverse applications in disciplines such as (a) medical image analysis, (b) histopathology, (c) remote sensing, autonomous systems including (d) UAV positioning, (e) autonomous driving, and more. The figure is best viewed in color and with zoom-in for clarity.
  • Figure 2: Comparisons on cross-modality tomography image registration datasets. Our trained models are compared with four representative baselines. Parts a and b are the results of the Harvard Brain dataset and Liver CT-MR dataset respectively. The curve of the success rate (SR) metric under different thresholds is shown on the left-up side of each part, where the detailed comparisons with relative improvements on SR@10 pixels metric are shown on the left-down side. Qualitative comparisons of predicted matches and aligned images are shown on the right side of each section. The matches are colored by the match error, where the green color means that the match error is less than 5 pixels. For a full table of quantitative comparisons with baselines, see the Extended Data Tab. \ref{['tab:exp full']}.
  • Figure 3: Results on cross-modality histology registration and retina image registration tasks.Parts a, b show the results of the histology registration task evaluated on the ANHIR dataset and the results of the retina image registration task using the FIRE dataset. For each part, the upper table shows the quantitative comparisons with state-of-the-art baselines, whereas the lower figure shows the matching and registration results of our trained models.
  • Figure 4: Results on visible-thermal registration tasks. Our trained models are compared with four representative baselines. Parts a, b, c show the results on remote sensing, aerial view, and ground view scenes respectively. The left column of each part shows the quantitative comparisons with baselines using success rates (SR) with a range of thresholds, as well as the detailed comparisons with relative improvements using SR@10 pixels for Part a, and SR@10$\degree$ for Parts b, c. The right column shows the qualitative comparisons with baselines in terms of matching quality and registration error. The green matches mean the match errors are less than 5 pixels for Part a and epipolar error is less than $3 \times 10^{-3}$ for Parts b, c. The aligned images and warping errors are shown in Part a and the pose estimation errors are shown in Part b, c. For a full table of quantitative comparisons with baselines, see Extended Data Tab. \ref{['tab:exp full']}.
  • Figure 5: Results on visible-SAR and visible-vectorized map registration tasks are shown in Parts a, b respectively. Our trained models are compared with four representative baselines. The left column shows the quantitative comparisons with baselines using success rate (SR) metrics at a range of thresholds, as well as the detailed comparisons using SR@10 pixels with the relative improvements of our methods over baselines. The right column compares the matching quality and images aligned by the transformations recovered from matches. For Part b, the matches are colored by the match errors, where the green means the error is within 5 pixels. For a full table of quantitative comparisons with baselines, see Extended Data Tab. \ref{['tab:exp full']}.
  • ...and 2 more figures