Table of Contents
Fetching ...

MINIMA: Modality Invariant Image Matching

Jiangwei Ren, Xingyu Jiang, Zizhuo Li, Dingkang Liang, Xin Zhou, Xiang Bai

TL;DR

MINIMA tackles cross-view and cross-modality image matching by decoupling model learning from data scarcity through a data engine that synthesizes a large multimodal dataset, MD-syn, from RGB data. The method pre-trains on RGB data and then fine-tunes on randomly selected cross-modal pairs using three backbone matchers to produce variants like MINIMA_LG, MINIMA_LoFTR, and MINIMA_RoMa. Across 19 cross-modal cases and both in-domain and zero-shot settings, MINIMA demonstrates substantial improvements over baselines and modality-specific approaches, validating the scalability and generalization of the approach. The work provides a practical, data-centric path to universal multimodal image matching and releases both the MD-syn dataset and code for community use.

Abstract

Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance universal performance from the perspective of data scaling up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate matching labels. Specifically, we scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. Under this setting, the matching labels and rich diversity of the RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multimodal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability. Extensive experiments on in-domain and zero-shot matching tasks, including $19$ cross-modal cases, demonstrate that our MINIMA can significantly outperform the baselines and even surpass modality-specific methods. The dataset and code are available at https://github.com/LSXI7/MINIMA.

MINIMA: Modality Invariant Image Matching

TL;DR

MINIMA tackles cross-view and cross-modality image matching by decoupling model learning from data scarcity through a data engine that synthesizes a large multimodal dataset, MD-syn, from RGB data. The method pre-trains on RGB data and then fine-tunes on randomly selected cross-modal pairs using three backbone matchers to produce variants like MINIMA_LG, MINIMA_LoFTR, and MINIMA_RoMa. Across 19 cross-modal cases and both in-domain and zero-shot settings, MINIMA demonstrates substantial improvements over baselines and modality-specific approaches, validating the scalability and generalization of the approach. The work provides a practical, data-centric path to universal multimodal image matching and releases both the MD-syn dataset and code for community use.

Abstract

Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance universal performance from the perspective of data scaling up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate matching labels. Specifically, we scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. Under this setting, the matching labels and rich diversity of the RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multimodal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability. Extensive experiments on in-domain and zero-shot matching tasks, including cross-modal cases, demonstrate that our MINIMA can significantly outperform the baselines and even surpass modality-specific methods. The dataset and code are available at https://github.com/LSXI7/MINIMA.
Paper Structure (30 sections, 1 equation, 11 figures, 12 tables)

This paper contains 30 sections, 1 equation, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Overall Image Matching Accuracy and Efficiency on Six Datasets of Real Cross-modal Image Pairs. AUC of the pose error (@$10^\circ$) or reprojection error (@$10$px) is used for accuracy evaluation, while Pairs Per Second is used for efficiency test. Left: AUCs on each dataset of representative methods are reported. Right: average performance is summarized, wherein different colors indicate matching pipelines of sparse, semi-dense, and dense matching, while our MINIMA is marked as $\bigstar$. Using only synthetic multimodal data created by our data engine, MINIMA can generalize to real cross-modal scenes with large improvements.
  • Figure 2: Qualitative Results on Real Cross-modal Image Pairs. Our methods MINIMA$_\mathrm{LG}$ (sparse) and MINIMA$_\mathrm{RoMa}$ (dense) are compared with the sparse matching pipeline ReDFeat deng2022redfeat and OmniGlue jiang2024omniglue, and semi-dense matcher XoFTR tuzcuouglu2024xoftr. ReDFeat and XoFTR are cross-modal methods, and OmniGlue is known for its generalization ability. Matches generated by each method are drawn, where the red lines indicate epipolar error (pose) or projection error (homography) beyond $5\times 10^{-4}$ or $3$ pixels. Details are recorded in the top-left of each image pair, including the geometric errors created by default RANSAC estimation and the (# correct match / # match).
  • Figure 3: Overview of the Proposed MINIMA Pipeline: Trained Once to Achieve Any Cross-modal Matching Tasks. Wherein the Data Engine is to generate a large multimodal matching dataset, supporting the training of matching models to obtain cross-modal ability.
  • Figure A1: Visualization Results of Infrared generation on MSRS. The first two columns are real RGB and Infrared images.
  • Figure A2: Visualization Results of Infrared Generation on M3FD. The first two columns are real RGB and Infrared images.
  • ...and 6 more figures