Table of Contents
Fetching ...

SELC: Self-Supervised Efficient Local Correspondence Learning for Low Quality Images

Yuqing Wang, Yan Wang, Hailiang Tang, Xiaoji Niu

TL;DR

SELC addresses the need for accurate yet efficient feature matching in SLAM by proposing a lightweight, patch-based CNN that learns dense local descriptors without manual annotations. It integrates traditional tracking signals via a hybrid self-supervision paradigm and enforces both intra-frame and inter-frame consistency through a combination of keypoint, heat-map, and dense descriptor losses, plus single and multi-frame consistency losses. The approach yields strong short-term accuracy and robust long-term drift mitigation while maintaining efficiency, achieving near state-of-the-art speeds at low resolutions and substantial gains at high resolutions through pyramid inference. Evaluations on MegaDepth, KITTI, HPatches, and Euroc demonstrate competitive repeatability and particularly notable efficiency improvements for high-resolution imagery, making the method well-suited for resource-constrained visual localization and SLAM pipelines.

Abstract

Accurate and stable feature matching is critical for computer vision tasks, particularly in applications such as Simultaneous Localization and Mapping (SLAM). While recent learning-based feature matching methods have demonstrated promising performance in challenging spatiotemporal scenarios, they still face inherent trade-offs between accuracy and computational efficiency in specific settings. In this paper, we propose a lightweight feature matching network designed to establish sparse, stable, and consistent correspondence between multiple frames. The proposed method eliminates the dependency on manual annotations during training and mitigates feature drift through a hybrid self-supervised paradigm. Extensive experiments validate three key advantages: (1) Our method operates without dependency on external prior knowledge and seamlessly incorporates its hybrid training mechanism into original datasets. (2) Benchmarked against state-of-the-art deep learning-based methods, our approach maintains equivalent computational efficiency at low-resolution scales while achieving a 2-10x improvement in computational efficiency for high-resolution inputs. (3) Comparative evaluations demonstrate that the proposed hybrid self-supervised scheme effectively mitigates feature drift in long-term tracking while maintaining consistent representation across image sequences.

SELC: Self-Supervised Efficient Local Correspondence Learning for Low Quality Images

TL;DR

SELC addresses the need for accurate yet efficient feature matching in SLAM by proposing a lightweight, patch-based CNN that learns dense local descriptors without manual annotations. It integrates traditional tracking signals via a hybrid self-supervision paradigm and enforces both intra-frame and inter-frame consistency through a combination of keypoint, heat-map, and dense descriptor losses, plus single and multi-frame consistency losses. The approach yields strong short-term accuracy and robust long-term drift mitigation while maintaining efficiency, achieving near state-of-the-art speeds at low resolutions and substantial gains at high resolutions through pyramid inference. Evaluations on MegaDepth, KITTI, HPatches, and Euroc demonstrate competitive repeatability and particularly notable efficiency improvements for high-resolution imagery, making the method well-suited for resource-constrained visual localization and SLAM pipelines.

Abstract

Accurate and stable feature matching is critical for computer vision tasks, particularly in applications such as Simultaneous Localization and Mapping (SLAM). While recent learning-based feature matching methods have demonstrated promising performance in challenging spatiotemporal scenarios, they still face inherent trade-offs between accuracy and computational efficiency in specific settings. In this paper, we propose a lightweight feature matching network designed to establish sparse, stable, and consistent correspondence between multiple frames. The proposed method eliminates the dependency on manual annotations during training and mitigates feature drift through a hybrid self-supervised paradigm. Extensive experiments validate three key advantages: (1) Our method operates without dependency on external prior knowledge and seamlessly incorporates its hybrid training mechanism into original datasets. (2) Benchmarked against state-of-the-art deep learning-based methods, our approach maintains equivalent computational efficiency at low-resolution scales while achieving a 2-10x improvement in computational efficiency for high-resolution inputs. (3) Comparative evaluations demonstrate that the proposed hybrid self-supervised scheme effectively mitigates feature drift in long-term tracking while maintaining consistent representation across image sequences.

Paper Structure

This paper contains 19 sections, 20 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The pipeline of the proposed method. In the preprocessing stage, fixed-size image patches surrounding the keypoints are extracted. The method achieves exceptional speed through shallow convolutional operations, followed by the generation of a compact 32-dimensional dense descriptor map $\mathbf{D}$ in the subsequent encoding phase. Sub-pixel feature locations are obtained through similarity computation and differentiable feature extraction
  • Figure 2: Single Consistency Loss. The red dots represent the matched features. For each tracked point in the first frame, corresponding image patches at different locations of the ground truth in the second frame are identified. A cost function is constructed using the coordinates $\tilde{\mathbf{p}}^i$ of multiple randomly shifted image patches relative to each other.
  • Figure 3: Multiple Consistency Loss. The red points represent positions obtained from bidirectional optical flow, while the green points denote target tracking points. Utilizing the feature maps over multiple epochs, the cost function is computed between the frame-by-frame extracted feature point coordinates $\tilde{\mathbf{p}}$ and similarity map $\mathbf{C}$, as well as the cross-frame extracted feature point coordinates $\tilde{\mathbf{p}}'$ and similarity map $\mathbf{C}'$.
  • Figure 4: Different network reasoning architectures. The left side is single-patch direct reasoning, and the right side is pyramid-patches coarse-to-fine reasoning.