Table of Contents
Fetching ...

XFeat: Accelerated Features for Lightweight Image Matching

Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, Erickson R. Nascimento

TL;DR

XFeat addresses the need for fast, robust visual matching on resource-constrained devices by proposing a featherweight CNN with a separate keypoint detection branch and a lightweight semi-dense matching refinement. The architecture maintains high input resolution through an efficient downsampling strategy and produces a dense 64-D descriptor map $\mathbf{F}$ and a reliability map $\mathbf{R}$, enabling both sparse and semi-dense matching, with a coarse-to-fine refinement that predicts pixel-level offsets. Training combines descriptor learning via dual-softmax, reliability supervision, offset refinement supervision, and teacher-based keypoint distillation, controlled by a weighted loss $\mathcal{L} = \alpha \mathcal{L}_{ds} + \beta \mathcal{L}_{rel} + \gamma \mathcal{L}_{fine} + \delta \mathcal{L}_{kp}$. Empirically, XFeat achieves up to $5\times$ faster inference than lightweight baselines while maintaining competitive accuracy across relative pose estimation, homography, and visual localization, and runs in real-time on CPU-only hardware without specialized optimizations, enabling practical deployment in AR and mobile robotics.

Abstract

We introduce a lightweight and accurate architecture for resource-efficient visual correspondence. Our method, dubbed XFeat (Accelerated Features), revisits fundamental design choices in convolutional neural networks for detecting, extracting, and matching local features. Our new model satisfies a critical need for fast and robust algorithms suitable to resource-limited devices. In particular, accurate image matching requires sufficiently large image resolutions - for this reason, we keep the resolution as large as possible while limiting the number of channels in the network. Besides, our model is designed to offer the choice of matching at the sparse or semi-dense levels, each of which may be more suitable for different downstream applications, such as visual navigation and augmented reality. Our model is the first to offer semi-dense matching efficiently, leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent, surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy, proven in pose estimation and visual localization. We showcase it running in real-time on an inexpensive laptop CPU without specialized hardware optimizations. Code and weights are available at www.verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24.

XFeat: Accelerated Features for Lightweight Image Matching

TL;DR

XFeat addresses the need for fast, robust visual matching on resource-constrained devices by proposing a featherweight CNN with a separate keypoint detection branch and a lightweight semi-dense matching refinement. The architecture maintains high input resolution through an efficient downsampling strategy and produces a dense 64-D descriptor map and a reliability map , enabling both sparse and semi-dense matching, with a coarse-to-fine refinement that predicts pixel-level offsets. Training combines descriptor learning via dual-softmax, reliability supervision, offset refinement supervision, and teacher-based keypoint distillation, controlled by a weighted loss . Empirically, XFeat achieves up to faster inference than lightweight baselines while maintaining competitive accuracy across relative pose estimation, homography, and visual localization, and runs in real-time on CPU-only hardware without specialized optimizations, enabling practical deployment in AR and mobile robotics.

Abstract

We introduce a lightweight and accurate architecture for resource-efficient visual correspondence. Our method, dubbed XFeat (Accelerated Features), revisits fundamental design choices in convolutional neural networks for detecting, extracting, and matching local features. Our new model satisfies a critical need for fast and robust algorithms suitable to resource-limited devices. In particular, accurate image matching requires sufficiently large image resolutions - for this reason, we keep the resolution as large as possible while limiting the number of channels in the network. Besides, our model is designed to offer the choice of matching at the sparse or semi-dense levels, each of which may be more suitable for different downstream applications, such as visual navigation and augmented reality. Our model is the first to offer semi-dense matching efficiently, leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent, surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy, proven in pose estimation and visual localization. We showcase it running in real-time on an inexpensive laptop CPU without specialized hardware optimizations. Code and weights are available at www.verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24.
Paper Structure (31 sections, 7 equations, 9 figures, 6 tables)

This paper contains 31 sections, 7 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: In XFeat, accuracy meets efficiency. XFeat delivers great trade-off between speed and relative pose estimation accuracy on the Megadepth-1500 dataset, as evidenced by the Pareto-frontier curve in orange. Its lightweight architecture enables real-time feature extraction on GPU-free settings and resource-constrained devices without hardware-specific optimizations. Inference speed on a budget-friendly laptop (Intel(R) i5-1135G7 @ 2.40GHz CPU) at VGA resolution. $^*$ denotes semi-dense extraction.
  • Figure 2: Sparse (top) and semi-dense (bottom) matching. XFeat stands out with its dual ability to perform both sparse and semi-dense matching, providing fast features for a wide range of applications from visual localization with sparse matches to pose estimation and 3D reconstruction where denser correspondences deliver additional constraints and a more complete representation.
  • Figure 3: Accelerated feature extraction network architecture. XFeat extracts a keypoint heatmap $\mathbf{K}$, a compact 64-D dense descriptor map $\mathbf{F}$, and a reliability heatmap $\mathbf{R}$. It achieves unparalleled speed via early downsampling and shallow convolutions, followed by deeper convolutions in later encoders for robustness. Contrary to typical methods, it separates keypoint detection into a distinct branch, using $1 \times 1$ convolutions on an $8 \times 8$ tensor-block-transformed image for fast processing.
  • Figure 4: Match refinement module for dense matching setting. This module learns to predict pixel-level offsets by only considering as input pairs of nearest neighbors from the original coarse-level features at $1/8$ of original spatial resolution, significantly saving memory and compute.
  • Figure 5: Qualitative results on Megadepth-1500. XFeat$^{*}$ and XFeat demonstrate exceptional robustness against variations in viewpoint and illumination. This is especially evident in challenging scenarios where heavy methods like DISK$^{*}$ breaks and XFeat$^{*}$ provide accurate relative pose $16 \times$ times faster in semi-dense settings with a comparable number of local features.
  • ...and 4 more figures