Table of Contents
Fetching ...

D2-Net: A Trainable CNN for Joint Detection and Description of Local Features

Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, Torsten Sattler

TL;DR

D2-Net introduces a trainable CNN that jointly provides dense descriptors and pixel-level detections by postponing the detector to the descriptor maps, using a multiscale pyramid and soft detection scoring. It learns descriptors \hat{d}_{ij} from activations via L2 normalization and detects keypoints through a soft, multi-channel scoring mechanism, all governed by an extended triplet-margin loss that couples descriptor discrimination with repeatable detections. Trained on pixel correspondences from the MegaDepth dataset and using a VGG16 backbone with dilated convolutions at test time, the approach achieves state-of-the-art performance on Aachen Day-Night and InLoc while remaining competitive on image matching and 3D reconstruction benchmarks. The results demonstrate that describe-and-detect with a shared representation yields robust, scalable local features suitable for large-scale SfM and visual localization under challenging illumination and indoor conditions.

Abstract

In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.

D2-Net: A Trainable CNN for Joint Detection and Description of Local Features

TL;DR

D2-Net introduces a trainable CNN that jointly provides dense descriptors and pixel-level detections by postponing the detector to the descriptor maps, using a multiscale pyramid and soft detection scoring. It learns descriptors \hat{d}_{ij} from activations via L2 normalization and detects keypoints through a soft, multi-channel scoring mechanism, all governed by an extended triplet-margin loss that couples descriptor discrimination with repeatable detections. Trained on pixel correspondences from the MegaDepth dataset and using a VGG16 backbone with dilated convolutions at test time, the approach achieves state-of-the-art performance on Aachen Day-Night and InLoc while remaining competitive on image matching and 3D reconstruction benchmarks. The results demonstrate that describe-and-detect with a shared representation yields robust, scalable local features suitable for large-scale SfM and visual localization under challenging illumination and indoor conditions.

Abstract

In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.

Paper Structure

This paper contains 18 sections, 13 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Examples of matches obtained by the D2-Net method. The proposed method can find image correspondences even under significant appearance differences caused by strong changes in illumination such as day-to-night, changes in depiction style or under image degradation caused by motion blur.
  • Figure 2: Comparison between different approaches for feature detection and description. Pipeline (a) corresponds to different variants of the two-stage detect-then-describe approach. In contrast, our proposed pipeline (b) uses a single CNN which extracts dense features that serve as both descriptors and detectors.
  • Figure 3: Proposed detect-and-describe (D2) network. A feature extraction CNN $\mathcal{F}$ is used to extract feature maps that play a dual role: (i) local descriptors $\mathbf{d}_{ij}$ are simply obtained by traversing all the $n$ feature maps $D^k$ at a spatial position $(i,j)$; (ii) detections are obtained by performing a non-local-maximum suppression on a feature map followed by a non-maximum suppression across each descriptor - during training, keypoint detection scores $s_{ij}$ are computed from a soft local-maximum score $\alpha$ and a ratio-to-maximum score per descriptor $\beta$.
  • Figure 4: Evaluation on HPatches HPATCHES image pairs. For each method, the mean matching accuracy (MMA) as a function of the matching threshold (in pixels) is shown. We also report the mean number of detected features and the mean number of mutual nearest neighbor matches. Our approach achieves the best overall performance after a threshold of $6.5$px, both using a single (SS) and multiple (MS) scales.
  • Figure 5: Evaluation on the Aachen Day-Night dataset Sattler2012ImageSattler2017Benchmarking. We report the percentage of images registered within given error thresholds. Our approach improves upon state-of-the art methods by a significant margin under strict pose thresholds.
  • ...and 7 more figures