Table of Contents
Fetching ...

Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching

Hongkai Chen, Zixin Luo, Yurun Tian, Xuyang Bai, Ziyu Wang, Lei Zhou, Mingmin Zhen, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan

TL;DR

This work tackles robust semi-dense image matching under cross-view deformations by introducing AffineFormer, a Transformer-based matcher with affine-based deformable local attention and selective global-local fusion. It integrates an intermediate flow regression and a piece-wise affine field to shape local sampling, enabling geometry-aware cross-view updates, complemented by a spatial softmax loss to enforce spatial consistency. The approach achieves state-of-the-art performance among semi-dense methods with similar cost to LoFTR, and a slim variant maintains accuracy at only around 15% of FLOPs and 18% of parameters, demonstrating strong efficiency. Experiments on two-view pose estimation and visual localization across indoor and outdoor datasets show strong generalization and practical impact for geometry estimation tasks.

Abstract

Identifying robust and accurate correspondences across images is a fundamental problem in computer vision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing relevant cross-view information through Transformer. In this paper, we propose several improvements upon this paradigm. Firstly, we introduce affine-based local attention to model cross-view deformations. Secondly, we present selective fusion to merge local and global messages from cross attention. Apart from network structure, we also identify the importance of enforcing spatial smoothness in loss design, which has been omitted by previous works. Based on these augmentations, our network demonstrate strong matching capacity under different settings. The full version of our network achieves state-of-the-art performance among semi-dense matching methods at a similar cost to LoFTR, while the slim version reaches LoFTR baseline's performance with only 15% computation cost and 18% parameters.

Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching

TL;DR

This work tackles robust semi-dense image matching under cross-view deformations by introducing AffineFormer, a Transformer-based matcher with affine-based deformable local attention and selective global-local fusion. It integrates an intermediate flow regression and a piece-wise affine field to shape local sampling, enabling geometry-aware cross-view updates, complemented by a spatial softmax loss to enforce spatial consistency. The approach achieves state-of-the-art performance among semi-dense methods with similar cost to LoFTR, and a slim variant maintains accuracy at only around 15% of FLOPs and 18% of parameters, demonstrating strong efficiency. Experiments on two-view pose estimation and visual localization across indoor and outdoor datasets show strong generalization and practical impact for geometry estimation tasks.

Abstract

Identifying robust and accurate correspondences across images is a fundamental problem in computer vision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing relevant cross-view information through Transformer. In this paper, we propose several improvements upon this paradigm. Firstly, we introduce affine-based local attention to model cross-view deformations. Secondly, we present selective fusion to merge local and global messages from cross attention. Apart from network structure, we also identify the importance of enforcing spatial smoothness in loss design, which has been omitted by previous works. Based on these augmentations, our network demonstrate strong matching capacity under different settings. The full version of our network achieves state-of-the-art performance among semi-dense matching methods at a similar cost to LoFTR, while the slim version reaches LoFTR baseline's performance with only 15% computation cost and 18% parameters.
Paper Structure (21 sections, 11 equations, 4 figures, 4 tables)

This paper contains 21 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Visualization of the proposed deformable attention. Through piece-wise deformation estimation, we project source patches (left) to the target image (right) to sample tokens in local attention.
  • Figure 2: The overall structure of our proposed network. The network adopts iterative global-local attention operations to pass cross-view messages at both global and local scales. After identifying coarse level matches at $1/8$ resolution, a convolution refiner follows to predict correspondence residuals.
  • Figure 3: Visualization of image matches from LoFTR sun2021loftr, ASpanFormer chen2022aspanformer and our method, where green lines represent inlier matches, while red lines represent outlier matches.
  • Figure 4: Heatmap for selective local fusion score.