Table of Contents
Fetching ...

A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios

Huy-Hoang Bui, Bach-Thuan Bui, Quang-Vinh Tran, Yasuyuki Fujii, Joo-Ho Lee

TL;DR

A-SCoRe tackles the challenge of robust visual localization by introducing an attention-based SCR that operates on descriptor maps, enabling flexible use with both dense depth data and sparse SfM models. It combines a CNN-based image encoder with a transformer to produce rich, spatially informed descriptors, which are then mapped to 3D scene coordinates through a shared MLP in both dense and sparse training regimes. The approach achieves competitive accuracy with substantially fewer parameters and storage than many state-of-the-art methods, particularly in indoor scenes, while maintaining modality versatility. While outdoor results lag compared to structure-based methods, the design offers practical benefits for mobile robots needing lightweight, multi-modal localization pipelines, with clear paths for speed and robustness enhancements in future work.

Abstract

Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: https://github.com/ais-lab/A-SCoRe.

A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios

TL;DR

A-SCoRe tackles the challenge of robust visual localization by introducing an attention-based SCR that operates on descriptor maps, enabling flexible use with both dense depth data and sparse SfM models. It combines a CNN-based image encoder with a transformer to produce rich, spatially informed descriptors, which are then mapped to 3D scene coordinates through a shared MLP in both dense and sparse training regimes. The approach achieves competitive accuracy with substantially fewer parameters and storage than many state-of-the-art methods, particularly in indoor scenes, while maintaining modality versatility. While outdoor results lag compared to structure-based methods, the design offers practical benefits for mobile robots needing lightweight, multi-modal localization pipelines, with clear paths for speed and robustness enhancements in future work.

Abstract

Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: https://github.com/ais-lab/A-SCoRe.

Paper Structure

This paper contains 14 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Proposed method A-SCoRe overal architecture
  • Figure 2: Comparison of map built from depth and SfM. It is clear that sparse SfM omit much information where number of samples is low (upper part of the stairs). The image show a case of textureless region which cause difficulty for sparse SCR approaches.
  • Figure 3: A-SCoRe shared image encoder. From left to right, an image $\mathbf{I}_i$ pass through the convolutional layers denoted as $\mathcal{F}_c$.
  • Figure 4: Illustration of the dense mode. Each feature in the attention feature map will be map to a scene coordinate using the MLP network $\Phi$.
  • Figure 5: Illustration of the A-SCoRe sparse mode. Keypoint detector output 2D pixel locations which are used to bilinear sample descriptor from attention feature map. Keypoint detector, if contain trainable parameters, are frozen.
  • ...and 2 more figures