Table of Contents
Fetching ...

DaD: Distilled Reinforcement Learning for Diverse Keypoint Detection

Johan Edstedt, Georg Bökman, Mårten Wadenbäck, Michael Felsberg

TL;DR

DaD tackles descriptor-free keypoint detection for Structure-from-Motion by training a keypoint detector with reinforcement learning and a balanced top-K sampling strategy. It uncovers two emergent detectors—light and dark—and fuses them via point-wise maximum knowledge distillation to form DaD, a diverse, descriptor-free detector. Across MegaDepth1500, ScanNet1500, and HPatches, DaD achieves state-of-the-art performance, especially in few-keypoint scenarios, without relying on SfM tracks or descriptors. The method offers a scalable, self-supervised solution that strengthens two-view reconstruction pipelines while addressing inherent biases in single-type detectors.

Abstract

Keypoints are what enable Structure-from-Motion (SfM) systems to scale to thousands of images. However, designing a keypoint detection objective is a non-trivial task, as SfM is non-differentiable. Typically, an auxiliary objective involving a descriptor is optimized. This however induces a dependency on the descriptor, which is undesirable. In this paper we propose a fully self-supervised and descriptor-free objective for keypoint detection, through reinforcement learning. To ensure training does not degenerate, we leverage a balanced top-K sampling strategy. While this already produces competitive models, we find that two qualitatively different types of detectors emerge, which are only able to detect light and dark keypoints respectively. To remedy this, we train a third detector, DaD, that optimizes the Kullback-Leibler divergence of the pointwise maximum of both light and dark detectors. Our approach significantly improve upon SotA across a range of benchmarks. Code and model weights are publicly available at https://github.com/parskatt/dad

DaD: Distilled Reinforcement Learning for Diverse Keypoint Detection

TL;DR

DaD tackles descriptor-free keypoint detection for Structure-from-Motion by training a keypoint detector with reinforcement learning and a balanced top-K sampling strategy. It uncovers two emergent detectors—light and dark—and fuses them via point-wise maximum knowledge distillation to form DaD, a diverse, descriptor-free detector. Across MegaDepth1500, ScanNet1500, and HPatches, DaD achieves state-of-the-art performance, especially in few-keypoint scenarios, without relying on SfM tracks or descriptors. The method offers a scalable, self-supervised solution that strengthens two-view reconstruction pipelines while addressing inherent biases in single-type detectors.

Abstract

Keypoints are what enable Structure-from-Motion (SfM) systems to scale to thousands of images. However, designing a keypoint detection objective is a non-trivial task, as SfM is non-differentiable. Typically, an auxiliary objective involving a descriptor is optimized. This however induces a dependency on the descriptor, which is undesirable. In this paper we propose a fully self-supervised and descriptor-free objective for keypoint detection, through reinforcement learning. To ensure training does not degenerate, we leverage a balanced top-K sampling strategy. While this already produces competitive models, we find that two qualitatively different types of detectors emerge, which are only able to detect light and dark keypoints respectively. To remedy this, we train a third detector, DaD, that optimizes the Kullback-Leibler divergence of the pointwise maximum of both light and dark detectors. Our approach significantly improve upon SotA across a range of benchmarks. Code and model weights are publicly available at https://github.com/parskatt/dad

Paper Structure

This paper contains 46 sections, 4 theorems, 34 equations, 12 figures, 7 tables.

Key Result

Theorem I.5

The set of local maxima of a keypoint distribution $p$ is

Figures (12)

  • Figure 1: Overview of DaD. We present a method to train a keypoint detector that requires neither a descriptor, nor supervision from SfM tracks, yet achieves SotA performance. We use reinforcement learning (\ref{['sec:rl', 'sec:reward', 'sec:sampling']}) to iteratively improve our detector through a two-view repeatability reward in combination with a simple regularization objective (\ref{['sec:regularization']}). We find that two types of detectors, which detect only light and dark keypoints respectively, emerge from optimizing the RL objective (\ref{['sec:emerge']}). This is problematic, as many repeatable keypoints are missed. We tackle this by combining the detectors through point-wise maximum knowledge distillation (\ref{['sec:distill']}) to a final powerful and diverse keypoint detector, which we call DaD. DaD sets a new state-of-the-art for keypoint detection, as our experiments in \ref{['sec:results']} show.
  • Figure 2: Qualitative example of DaD keypoint detections. We find a fundamental issue with previous rotation invariant self-supervised detectors which only lets them detect light or dark types of keypoints (see \ref{['sec:emerge']} for details). We remedy this through point-wise maximum knowledge distillation (see \ref{['sec:distill']}). As can be seen in the figure, our approach has no such issue, see e.g., the light keypoints on the cross (left zoom in), and the dark keypoints on the building edge (right zoom in). A qualitative comparison with previous self-supervised detectors (where this issue occurs) is presented in \ref{['suppl:qualitative-comparison']}.
  • Figure 3: Light vs dark keypoints. Detectors can significantly increase the expected reward by choosing either dark keypoints, where the pixel intensity is low, or light keypoints, where the pixel intensity is high. It turns out that either of these choices produce approximately the same expected reward. However, we argue that this is an undesirable property, e.g., due to inversions that can occur naturally due to day-night changes, or that certain images may be dominated by either dark or light keypoints.
  • Figure 4: Enforcing rotation invariance causes light/dark keypoint detectors also in ALIKED.Top: Detections of ALIKED trained on upright images. Bottom: Detections of ALIKED trained with rotation augmentation. Remarkably, we observe that enforcing rotation invariance is what causes the emergence of light/dark keypoint detectors, and that this holds also for ALIKED, which uses a different objective and architecture than ours.
  • Figure 5: Why max is good. Given two detectors $p$ and $q$, we would like their ensemble to retain the original keypoints. While averaging or multiplying the distributions typically change the shape and locations of the keypoints, the max operation preserves the peaks significantly better.
  • ...and 7 more figures

Theorems & Definitions (13)

  • Definition I.1: Keypoint
  • Definition I.2: Subsumed keypoint
  • Definition I.3: Partner keypoints
  • Definition I.4: Keypoint distribution
  • Theorem I.5
  • proof
  • Definition I.6: Partner distributions
  • Theorem I.7: No extra maxima
  • proof
  • Theorem I.8
  • ...and 3 more