Table of Contents
Fetching ...

CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

Tangfei Liao, Xiaoqin Zhang, Guobao Xiao, Min Li, Tao Wang, Mang Ye

TL;DR

CorrMAE tackles the cost and data challenges of pre-training for correspondence pruning by introducing masked correspondence reconstruction. It extends Masked Autoencoder with a dual-branch reconstruction mechanism and a bi-level CorrFormer encoder to handle unordered, irregular correspondences, aided by an alignment loss and a task-driven fine-tuning pipeline. The approach yields state-of-the-art gains on downstream tasks such as camera pose estimation, visual localization, and correspondence pruning benchmarks, while remaining data-efficient and transfer-friendly. Overall, CorrMAE provides a practical, plug-and-play pre-training framework that lowers data requirements while improving downstream geometric estimation performance.

Abstract

Pre-training has emerged as a simple yet powerful methodology for representation learning across various domains. However, due to the expensive training cost and limited data, pre-training has not yet been extensively studied in correspondence pruning. To tackle these challenges, we propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences, providing a strong initial representation for downstream tasks. Toward this objective, a modicum of true correspondences naturally serve as input, thus significantly reducing pre-training overhead. In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning. CorrMAE involves two main phases, \ie correspondence learning and matching point reconstruction, guiding the reconstruction of masked correspondences through learning visible correspondence consistency. Herein, we employ a dual-branch structure with an ingenious positional encoding to reconstruct unordered and irregular correspondences. Also, a bi-level designed encoder is proposed for correspondence learning, which offers enhanced consistency learning capability and transferability. Extensive experiments have shown that the model pre-trained with our CorrMAE outperforms prior work on multiple challenging benchmarks. Meanwhile, our CorrMAE is primarily a task-driven pre-training method, and can achieve notable improvements for downstream tasks by pre-training on the targeted dataset. We hope this work can provide a starting point for correspondence pruning pre-training.

CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

TL;DR

CorrMAE tackles the cost and data challenges of pre-training for correspondence pruning by introducing masked correspondence reconstruction. It extends Masked Autoencoder with a dual-branch reconstruction mechanism and a bi-level CorrFormer encoder to handle unordered, irregular correspondences, aided by an alignment loss and a task-driven fine-tuning pipeline. The approach yields state-of-the-art gains on downstream tasks such as camera pose estimation, visual localization, and correspondence pruning benchmarks, while remaining data-efficient and transfer-friendly. Overall, CorrMAE provides a practical, plug-and-play pre-training framework that lowers data requirements while improving downstream geometric estimation performance.

Abstract

Pre-training has emerged as a simple yet powerful methodology for representation learning across various domains. However, due to the expensive training cost and limited data, pre-training has not yet been extensively studied in correspondence pruning. To tackle these challenges, we propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences, providing a strong initial representation for downstream tasks. Toward this objective, a modicum of true correspondences naturally serve as input, thus significantly reducing pre-training overhead. In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning. CorrMAE involves two main phases, \ie correspondence learning and matching point reconstruction, guiding the reconstruction of masked correspondences through learning visible correspondence consistency. Herein, we employ a dual-branch structure with an ingenious positional encoding to reconstruct unordered and irregular correspondences. Also, a bi-level designed encoder is proposed for correspondence learning, which offers enhanced consistency learning capability and transferability. Extensive experiments have shown that the model pre-trained with our CorrMAE outperforms prior work on multiple challenging benchmarks. Meanwhile, our CorrMAE is primarily a task-driven pre-training method, and can achieve notable improvements for downstream tasks by pre-training on the targeted dataset. We hope this work can provide a starting point for correspondence pruning pre-training.
Paper Structure (22 sections, 7 equations, 6 figures, 7 tables)

This paper contains 22 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (a) Comparison of pre-training costs using the conventional method, i.e., initial correspondence classification task, and our proposed method. Meanwhile, some graph-based correspondence pruning methods Zhao2021Dai2022liu2023ncmdai2024mgnet are used as encoders. We report results averaged by batch size for training, measured on NVIDIA Tesla V100 GPU. (b) Comparing the previous learning paradigm and our pretraining-finetuning paradigm both for correspondence pruning. The correspondence is drawn in green if it represents the inlier and red for the outlier.
  • Figure 2: The overview of our method.
  • Figure 3: The pipeline of our CorrMAE. Given a set of true correspondences selected by an empirical geometric threshold, CorrMAE aims to obtain inlier representations with strong generalization through the masked correspondence reconstruction task. The design details of each phase of CorrMAE are introduced in Section \ref{['sec:corrmae']}. Please note that to better distinguish between two branches, we introduce the concepts of source and target images. In fact, our pipeline does not involve images, but true correspondences (4D) as input.
  • Figure 4: Illustration of our proposed CorrFormer encoder. During fine-tuning, we integrate the CorrFormer encoder into the iterative network and employ a pruning strategy Zhao2021 to maximize its capabilities.
  • Figure 5: The examples of reconstruction results for masked correspondences. The left column represents the original correspondences, the middle column means the remaining correspondences, and the right column denotes the reconstruction results for masked correspondences.
  • ...and 1 more figures