D2-Net: A Trainable CNN for Joint Detection and Description of Local Features
Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, Torsten Sattler
TL;DR
D2-Net introduces a trainable CNN that jointly provides dense descriptors and pixel-level detections by postponing the detector to the descriptor maps, using a multiscale pyramid and soft detection scoring. It learns descriptors \hat{d}_{ij} from activations via L2 normalization and detects keypoints through a soft, multi-channel scoring mechanism, all governed by an extended triplet-margin loss that couples descriptor discrimination with repeatable detections. Trained on pixel correspondences from the MegaDepth dataset and using a VGG16 backbone with dilated convolutions at test time, the approach achieves state-of-the-art performance on Aachen Day-Night and InLoc while remaining competitive on image matching and 3D reconstruction benchmarks. The results demonstrate that describe-and-detect with a shared representation yields robust, scalable local features suitable for large-scale SfM and visual localization under challenging illumination and indoor conditions.
Abstract
In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.
