Table of Contents
Fetching ...

SuperPoint: Self-Supervised Interest Point Detection and Description

Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich

TL;DR

The paper tackles the need for robust, repeatable interest-point detection and description for multi-view geometry by introducing SuperPoint, a fully-convolutional network that jointly detects keypoints and computes 256-d descriptors in one pass. It bootstraps from synthetic data (MagicPoint on Synthetic Shapes) and leverages Homographic Adaptation to self-label unlabeled real images, enabling strong synthetic-to-real transfer. Key contributions include a two-headed, shared-encoder architecture, a self-supervised training pipeline, and state-of-the-art HPatches performance with real-time speed, particularly in illumination-robust scenarios. The work paves the way for learning-based, end-to-end feature extraction suitable for SLAM, SfM, and image matching in diverse environments.

Abstract

This paper presents a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed to patch-based neural networks, our fully-convolutional model operates on full-sized images and jointly computes pixel-level interest point locations and associated descriptors in one forward pass. We introduce Homographic Adaptation, a multi-scale, multi-homography approach for boosting interest point detection repeatability and performing cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on the MS-COCO generic image dataset using Homographic Adaptation, is able to repeatedly detect a much richer set of interest points than the initial pre-adapted deep model and any other traditional corner detector. The final system gives rise to state-of-the-art homography estimation results on HPatches when compared to LIFT, SIFT and ORB.

SuperPoint: Self-Supervised Interest Point Detection and Description

TL;DR

The paper tackles the need for robust, repeatable interest-point detection and description for multi-view geometry by introducing SuperPoint, a fully-convolutional network that jointly detects keypoints and computes 256-d descriptors in one pass. It bootstraps from synthetic data (MagicPoint on Synthetic Shapes) and leverages Homographic Adaptation to self-label unlabeled real images, enabling strong synthetic-to-real transfer. Key contributions include a two-headed, shared-encoder architecture, a self-supervised training pipeline, and state-of-the-art HPatches performance with real-time speed, particularly in illumination-robust scenarios. The work paves the way for learning-based, end-to-end feature extraction suitable for SLAM, SfM, and image matching in diverse environments.

Abstract

This paper presents a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems in computer vision. As opposed to patch-based neural networks, our fully-convolutional model operates on full-sized images and jointly computes pixel-level interest point locations and associated descriptors in one forward pass. We introduce Homographic Adaptation, a multi-scale, multi-homography approach for boosting interest point detection repeatability and performing cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on the MS-COCO generic image dataset using Homographic Adaptation, is able to repeatedly detect a much richer set of interest points than the initial pre-adapted deep model and any other traditional corner detector. The final system gives rise to state-of-the-art homography estimation results on HPatches when compared to LIFT, SIFT and ORB.

Paper Structure

This paper contains 24 sections, 15 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: SuperPoint for Geometric Correspondences. We present a fully-convolutional neural network that computes SIFT-like 2D interest point locations and descriptors in a single forward pass and runs at 70 FPS on $480\times640$ images with a Titan X GPU.
  • Figure 2: Self-Supervised Training Overview. In our self-supervised approach, we (a) pre-train an initial interest point detector on synthetic data and (b) apply a novel Homographic Adaptation procedure to automatically label images from a target, unlabeled domain. The generated labels are used to (c) train a fully-convolutional network that jointly extracts interest points and descriptors from an image.
  • Figure 3: SuperPoint Decoders. Both decoders operate on a shared and spatially reduced representation of the input. To keep the model fast and easy to train, both decoders use non-learned upsampling to bring the representation back to $\mathbb{R}^{H\times W}$.
  • Figure 4: Synthetic Pre-Training. We use our Synthetic Shapes dataset consisting of rendered triangles, quadrilaterals, lines, cubes, checkerboards, and stars each with ground truth corner locations. The dataset is used to train the MagicPoint convolutional neural network, which is more robust to noise when compared to classical detectors.
  • Figure 5: Homographic Adaptation. Homographic Adaptation is a form of self-supervision for boosting the geometric consistency of an interest point detector trained with convolutional neural networks. The entire procedure is mathematically defined in Equation \ref{['eqn:ha']}.
  • ...and 10 more figures