Table of Contents
Fetching ...

Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU

Yicheng Lin, Yunlong Jiang, Xujia Jiao, Bin Han

TL;DR

This work tackles long-term visual localization under appearance changes by proposing a hierarchical framework that blends real-time handcrafted feature tracking with selective, offline-learned keypoints for absolute pose. A unified learning-based feature extraction module enables cross-method compatibility, while a hierarchical pose optimization fuses handcrafted and learned observations within a local map to correct accumulated error on a CPU. The approach demonstrates substantial improvements in global localization accuracy across seasonal changes and maintains practical CPU efficiency, with 47% average error reduction reported in photometric variation scenarios. The results suggest a robust, universally applicable strategy for industrial robotics, paving the way for unified feature representations and more efficient learned-keypoint architectures.

Abstract

Robust long-term visual localization in complex industrial environments is critical for mobile robotic systems. Existing approaches face limitations: handcrafted features are illumination-sensitive, learned features are computationally intensive, and semantic- or marker-based methods are environmentally constrained. Handcrafted and learned features share similar representations but differ functionally. Handcrafted features are optimized for continuous tracking, while learned features excel in wide-baseline matching. Their complementarity calls for integration rather than replacement. Building on this, we propose a hierarchical localization framework. It leverages real-time handcrafted feature extraction for relative pose estimation. In parallel, it employs selective learned keypoint detection on optimized keyframes for absolute positioning. This design enables CPU-efficient, long-term visual localization. Experiments systematically progress through three validation phases: Initially establishing feature complementarity through comparative analysis, followed by computational latency profiling across algorithm stages on CPU platforms. Final evaluation under photometric variations (including seasonal transitions and diurnal cycles) demonstrates 47% average error reduction with significantly improved localization consistency. The code implementation is publicly available at https://github.com/linyicheng1/ORB_SLAM3_localization.

Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU

TL;DR

This work tackles long-term visual localization under appearance changes by proposing a hierarchical framework that blends real-time handcrafted feature tracking with selective, offline-learned keypoints for absolute pose. A unified learning-based feature extraction module enables cross-method compatibility, while a hierarchical pose optimization fuses handcrafted and learned observations within a local map to correct accumulated error on a CPU. The approach demonstrates substantial improvements in global localization accuracy across seasonal changes and maintains practical CPU efficiency, with 47% average error reduction reported in photometric variation scenarios. The results suggest a robust, universally applicable strategy for industrial robotics, paving the way for unified feature representations and more efficient learned-keypoint architectures.

Abstract

Robust long-term visual localization in complex industrial environments is critical for mobile robotic systems. Existing approaches face limitations: handcrafted features are illumination-sensitive, learned features are computationally intensive, and semantic- or marker-based methods are environmentally constrained. Handcrafted and learned features share similar representations but differ functionally. Handcrafted features are optimized for continuous tracking, while learned features excel in wide-baseline matching. Their complementarity calls for integration rather than replacement. Building on this, we propose a hierarchical localization framework. It leverages real-time handcrafted feature extraction for relative pose estimation. In parallel, it employs selective learned keypoint detection on optimized keyframes for absolute positioning. This design enables CPU-efficient, long-term visual localization. Experiments systematically progress through three validation phases: Initially establishing feature complementarity through comparative analysis, followed by computational latency profiling across algorithm stages on CPU platforms. Final evaluation under photometric variations (including seasonal transitions and diurnal cycles) demonstrates 47% average error reduction with significantly improved localization consistency. The code implementation is publicly available at https://github.com/linyicheng1/ORB_SLAM3_localization.

Paper Structure

This paper contains 19 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An intuitive comparison of handcrafted and learned features. (a) shows the matching results of ORB orb features in a tunnel with repetitive textures, while (b) shows the matching results of D2-Net D2-net under lighting variations. Features that are easy to track help maintain stable localization across consecutive frames, while features that are easy to match enable robust matching over long-term lighting changes.
  • Figure 2: Hierarchical Pipeline of Visual Localization. Visual localization encompasses two primary phases: mapping and positioning. Initially, a conventional Structure-from-Motion (SfM) pipeline is employed to construct a learning-based feature map. Subsequently, multi-condition image sequences captured under varying seasonal and weather conditions are utilized for localization. Within the hierarchical localization framework, traditional ORB features facilitate continuous inter-frame tracking and relative pose estimation, enabling real-time construction of a handcrafted feature map. A subset of keyframes is then selected for learning-based feature extraction and subsequent matching with the prior map. The final positioning is achieved through an optimization process that minimizes reprojection errors between local keyframes and both handcrafted and learned feature maps, thereby determining the camera's precise location within the pre-established map.
  • Figure 3: Unified keypoint extraction process. The networks for learning different keypoints are unified into standard input and output interfaces. The input is a color image of size $H\times W \times 3$, and the output consists of a score map of size $H\times W \times 1$, and a descriptor map of size $H/8 \times W/8 \times D$. The SuperPointsuperpoint network takes grayscale images as input, so a conversion is applied beforehand. Its output is a tensor of size $H/8\times W/8\times 65$, which needs to be processed via an unfold operation to achieve the standard form. The descriptor map of ALIKEalike is downsampled to obtain the standard size. The score map of D2-NetD2-net is obtained by applying a softmax operation on the descriptor map, followed by upsampling.
  • Figure 4: Local BA optimization problem. The dark green map points represent the prior visual map, which is also considered the global map. The pink map points represent the manually constructed real-time map. The light green cameras represent the camera poses near the current frame, referred to as the local map, whose poses will be optimized. The yellow cameras represent older cameras, which are fixed to provide continuity constraints. The projection errors of the global map points are used to estimate the accumulated error in the local map and the poses within the local map. The local map points are used solely to optimize the keyframe poses within the local map.
  • Figure 5: Comparison of localization trajectories across seasons. The comparison of localization trajectories obtained from handcrafted keypoint maps, learned keypoint maps, and the proposed hierarchical localization method intuitively demonstrates the effectiveness of the proposed approach. Notably, the localization results provided by AirSLAM air-slam are unavailable at certain moments, resulting in discrete and non-continuous trajectories.
  • ...and 1 more figures