Table of Contents
Fetching ...

Segment Anything Model is a Good Teacher for Local Feature Learning

Jingqian Wu, Rongtao Xu, Zach Wood-Doughty, Changwei Wang, Shibiao Xu, Edmund Y. Lam

TL;DR

SAMFeat leverages the Segment Anything Model (SAM) as a training teacher to boost local feature learning by injecting category-agnostic semantic information and fine-grained edges. It introduces three SAM-derived mechanisms—Attention-weighted Semantic Relation Distillation ($L_{dis}$), Weakly Supervised Contrastive Learning Based on Semantic Grouping ($L_{wsc}$), and Edge Attention Guidance ($L_{edge}$)—summed in $L_g = L_{dis}+L_{edge}+L_{wsc}$ and combined with detection and descriptor losses to form $L = L_g + L_{det} + L_{des}$. The approach enables end-to-end training with SAM providing intermediate supervision while inference remains cost-free from SAM. Across HPatches, Aachen Day-Night, and ETH 3D benchmarks, SAMFeat achieves state-of-the-art or competitive performance, highlighting the practical impact of using a large-scale foundation model as a training mentor for robust local feature learning. The work demonstrates that high-level semantic priors and precise edges from SAM can significantly enhance keypoint detection and description without sacrificing runtime efficiency.

Abstract

Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in "any scene" and "any downstream task". Data-driven local feature learning methods need to rely on pixel-level correspondence for training, which is challenging to acquire at scale, thus hindering further improvements in performance. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a fundamental model trained on 11 million images, as a teacher to guide local feature learning and thus inspire higher performance on limited datasets. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat's performance on various tasks such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.

Segment Anything Model is a Good Teacher for Local Feature Learning

TL;DR

SAMFeat leverages the Segment Anything Model (SAM) as a training teacher to boost local feature learning by injecting category-agnostic semantic information and fine-grained edges. It introduces three SAM-derived mechanisms—Attention-weighted Semantic Relation Distillation (), Weakly Supervised Contrastive Learning Based on Semantic Grouping (), and Edge Attention Guidance ()—summed in and combined with detection and descriptor losses to form . The approach enables end-to-end training with SAM providing intermediate supervision while inference remains cost-free from SAM. Across HPatches, Aachen Day-Night, and ETH 3D benchmarks, SAMFeat achieves state-of-the-art or competitive performance, highlighting the practical impact of using a large-scale foundation model as a training mentor for robust local feature learning. The work demonstrates that high-level semantic priors and precise edges from SAM can significantly enhance keypoint detection and description without sacrificing runtime efficiency.

Abstract

Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in "any scene" and "any downstream task". Data-driven local feature learning methods need to rely on pixel-level correspondence for training, which is challenging to acquire at scale, thus hindering further improvements in performance. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a fundamental model trained on 11 million images, as a teacher to guide local feature learning and thus inspire higher performance on limited datasets. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat's performance on various tasks such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.
Paper Structure (17 sections, 13 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 17 sections, 13 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a): Difference between segment anything model and common semantic segmentation model. (b): Schematic diagram of proposed SAMFeat.
  • Figure 2: The overview of our SAMFeat, which performs feature detection, description, edge depiction, and feature distillation end-to-end.
  • Figure 3: The detailed overview of SAMFeat. Notice that SAM is only applied in the training phase, while there is no computational cost in the inference phase.
  • Figure 4: Schematic diagrams of Relationship matrix and Attention map.
  • Figure 5: Example of Semantic Grouping. Different colored stars represent sampling points in different semantic groupings.
  • ...and 3 more figures