Table of Contents
Fetching ...

R2D2: Repeatable and Reliable Detector and Descriptor

Jerome Revaud, Philippe Weinzaepfel, César De Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, Martin Humenberger

TL;DR

R2D2 tackles the fundamental mismatch between repeatability and discriminativeness in local features by jointly learning a detector, descriptor, and a reliability predictor. It introduces a dense per-pixel descriptor with two confidence maps: repeatability S and reliability R, trained with self-supervised losses (cosine-similarity and peakiness for repeatability; AP-based ranking with κ-weighted reliability for descriptors). The method achieves state-of-the-art performance on HPatches and Aachen Day-Night, particularly benefiting tasks requiring robust matching under viewpoint and illumination changes, while maintaining a compact descriptor size. The approach is validated through extensive ablations and a localization pipeline, demonstrating practical impact for visual localization and 3D reconstruction tasks.

Abstract

Interest point detection and local feature description are fundamental steps in many computer vision applications. Classical methods for these tasks are based on a detect-then-describe paradigm where separate handcrafted methods are used to first identify repeatable keypoints and then represent them with a local descriptor. Neural networks trained with metric learning losses have recently caught up with these techniques, focusing on learning repeatable saliency maps for keypoint detection and learning descriptors at the detected keypoint locations. In this work, we argue that salient regions are not necessarily discriminative, and therefore can harm the performance of the description. Furthermore, we claim that descriptors should be learned only in regions for which matching can be performed with high confidence. We thus propose to jointly learn keypoint detection and description together with a predictor of the local descriptor discriminativeness. This allows us to avoid ambiguous areas and leads to reliable keypoint detections and descriptions. Our detection-and-description approach, trained with self-supervision, can simultaneously output sparse, repeatable and reliable keypoints that outperforms state-of-the-art detectors and descriptors on the HPatches dataset. It also establishes a record on the recently released Aachen Day-Night localization dataset.

R2D2: Repeatable and Reliable Detector and Descriptor

TL;DR

R2D2 tackles the fundamental mismatch between repeatability and discriminativeness in local features by jointly learning a detector, descriptor, and a reliability predictor. It introduces a dense per-pixel descriptor with two confidence maps: repeatability S and reliability R, trained with self-supervised losses (cosine-similarity and peakiness for repeatability; AP-based ranking with κ-weighted reliability for descriptors). The method achieves state-of-the-art performance on HPatches and Aachen Day-Night, particularly benefiting tasks requiring robust matching under viewpoint and illumination changes, while maintaining a compact descriptor size. The approach is validated through extensive ablations and a localization pipeline, demonstrating practical impact for visual localization and 3D reconstruction tasks.

Abstract

Interest point detection and local feature description are fundamental steps in many computer vision applications. Classical methods for these tasks are based on a detect-then-describe paradigm where separate handcrafted methods are used to first identify repeatable keypoints and then represent them with a local descriptor. Neural networks trained with metric learning losses have recently caught up with these techniques, focusing on learning repeatable saliency maps for keypoint detection and learning descriptors at the detected keypoint locations. In this work, we argue that salient regions are not necessarily discriminative, and therefore can harm the performance of the description. Furthermore, we claim that descriptors should be learned only in regions for which matching can be performed with high confidence. We thus propose to jointly learn keypoint detection and description together with a predictor of the local descriptor discriminativeness. This allows us to avoid ambiguous areas and leads to reliable keypoint detections and descriptions. Our detection-and-description approach, trained with self-supervision, can simultaneously output sparse, repeatable and reliable keypoints that outperforms state-of-the-art detectors and descriptors on the HPatches dataset. It also establishes a record on the recently released Aachen Day-Night localization dataset.

Paper Structure

This paper contains 20 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Toy examples to illustrate the key difference between repeatability (2nd column) and reliability (3rd column) for a given image. Repeatable regions in the first image are only located near the black triangle, however, all patches containing it are equally reliable. In contrast, all squares in the checkerboard pattern are salient hence repeatable, but none of them is discriminative due to self-similarity. Both confidence maps were estimated by our network.
  • Figure 2: Overview of our network for jointly learning repeatable and reliable matches.
  • Figure 3: Sample repeatability heatmaps obtained when training the repeatability losses $\mathcal{L}_{peaky}$ and $\mathcal{L}_{rep}$ with different patch size $N$. Red and green colors denote low and high values, respectively.
  • Figure 4: MMA@3 and M-score for different patch sizes $N$ on the HPatches dataset, as a function of the number of retained keypoints $K$ per image.
  • Figure 5: For one given input image (1st row), we show the repeatability (2nd row) and reliability heatmaps (3rd row) extracted at a single scale, overlaid onto the original image. Valid keypoints (both repeatable and reliable) are shown as crosses in the first image.
  • ...and 2 more figures