Table of Contents
Fetching ...

D2S: Representing sparse descriptors and 3D coordinates for camera relocalization

Bach-Thuan Bui, Huy-Hoang Bui, Dinh-Tuan Tran, Joo-Ho Lee

TL;DR

This study proposes a direct learning-based approach that utilizes a simple network named D2S to represent complex local descriptors and their scene coordinates and demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts.

Abstract

State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant costs in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent complex local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a simple outdoor dataset to evaluate the capabilities of visual localization methods in scene-specific generalization and self-updating from unlabeled observations. Our approach outperforms the previous regression-based methods in both indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s.

D2S: Representing sparse descriptors and 3D coordinates for camera relocalization

TL;DR

This study proposes a direct learning-based approach that utilizes a simple network named D2S to represent complex local descriptors and their scene coordinates and demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts.

Abstract

State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant costs in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent complex local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a simple outdoor dataset to evaluate the capabilities of visual localization methods in scene-specific generalization and self-updating from unlabeled observations. Our approach outperforms the previous regression-based methods in both indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s.
Paper Structure (18 sections, 11 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 11 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: D2S learning and inference pipeline. Our D2S learns from labeled descriptors and is able to self-update with unlabeled ones. At the test time, D2S receives a set of local descriptors and generates robust 3D coordinates for highly accurate re-localization.
  • Figure 2: D2S architecture. The proposed D2S pipeline enables direct learning to represent 3D coordinates from sparse keypoint descriptors. The method comprises a scene-agnostic feature extractor that extracts sparse descriptors from images, followed by a learnable scene-specific module aimed at regressing their 3D coordinates and reliability scores by leveraging multi-layer graph attention, and a shared MLP.
  • Figure 3: Self-Attention Results. The visualization shows the attention scores $\alpha_{ij}$ at graph layer 5. It demonstrates that D2S has succeeded in learning to focus on reliable features while disregarding the uncertain features.
  • Figure 4: Reliability Prediction Results. The red points denote the low predicted reliability for localization, whereas green points are the predicted robust ones.
  • Figure 5: Proposed challenge Ritsumeikan BKC dataset. The dataset consists of different image sequences used for training (daytime), testing (indicates in blue, green-high domain shifts, and cyan-nighttime), and unlabeled sequences for self-supervision.
  • ...and 3 more figures