Table of Contents
Fetching ...

Leveraging Spatial Attention and Edge Context for Optimized Feature Selection in Visual Localization

Nanda Febri Istighfarin, HyungGi Jo

TL;DR

This work introduces an attention network that selectively targets informative regions of the image to improve the feature selection process and combine the result with edge detection, thereby improving 2D-3D correspondence and overall localization performance.

Abstract

Visual localization determines an agent's precise position and orientation within an environment using visual data. It has become a critical task in the field of robotics, particularly in applications such as autonomous navigation. This is due to the ability to determine an agent's pose using cost-effective sensors such as RGB cameras. Recent methods in visual localization employ scene coordinate regression to determine the agent's pose. However, these methods face challenges as they attempt to regress 2D-3D correspondences across the entire image region, despite not all regions providing useful information. To address this issue, we introduce an attention network that selectively targets informative regions of the image. Using this network, we identify the highest-scoring features to improve the feature selection process and combine the result with edge detection. This integration ensures that the features chosen for the training buffer are located within robust regions, thereby improving 2D-3D correspondence and overall localization performance. Our approach was tested on the outdoor benchmark dataset, demonstrating superior results compared to previous methods.

Leveraging Spatial Attention and Edge Context for Optimized Feature Selection in Visual Localization

TL;DR

This work introduces an attention network that selectively targets informative regions of the image to improve the feature selection process and combine the result with edge detection, thereby improving 2D-3D correspondence and overall localization performance.

Abstract

Visual localization determines an agent's precise position and orientation within an environment using visual data. It has become a critical task in the field of robotics, particularly in applications such as autonomous navigation. This is due to the ability to determine an agent's pose using cost-effective sensors such as RGB cameras. Recent methods in visual localization employ scene coordinate regression to determine the agent's pose. However, these methods face challenges as they attempt to regress 2D-3D correspondences across the entire image region, despite not all regions providing useful information. To address this issue, we introduce an attention network that selectively targets informative regions of the image. Using this network, we identify the highest-scoring features to improve the feature selection process and combine the result with edge detection. This integration ensures that the features chosen for the training buffer are located within robust regions, thereby improving 2D-3D correspondence and overall localization performance. Our approach was tested on the outdoor benchmark dataset, demonstrating superior results compared to previous methods.

Paper Structure

This paper contains 14 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Proposed Network Architecture. We integrate the spatial attention network and edge detector results to ensure that only informative features are included in the training buffer.
  • Figure 2: Overview of the Spatial Attention Network. Our spatial attention network contains two main parts: the spatial attention calculation and the attention integration. Using this spatial attention network, we choose the top 30% features to be included in the training buffer.
  • Figure 3: Sampled features for the training buffer comparison. (a) ACE's sampled feature, (b) Only spatial attention network-based sampled feature, (c) Only edge detector-based sampled feature, (d) Spatial attention network and edge detector sampled feature. Using our proposed method (d), we avoid the featureless area that does not provide informative features for the network.
  • Figure 4: Comparison of ACE's sampled features and our spatial attention network's sampled features. (a) Features sampled using ACE's method, highlighting missed details. (b) Features identified exclusively by our spatial attention network, capturing areas overlooked by ACE.
  • Figure 5: Cambridge Landmarks dataset. (a) St. Mary's Church, (b) Shop Facade, (c) Old Hospital, and (d) Kings College.
  • ...and 3 more figures