Table of Contents
Fetching ...

SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition

Shunpeng Chen, Changwei Wang, Rongtao Xu, Xingtian Pei, Yukun Song, Jinzhou Lin, Wenhao Xu, Jingyi Zhang, Li Guo, Shibiao Xu

TL;DR

SAGE tackles Visual Place Recognition under severe appearance and viewpoint variation by introducing a slow-thinking training paradigm that continuously revises hard-neighborhood sampling via an online geo-visual graph. It combines Soft Probing for discriminative local patch emphasis, an InteractHead for cross-image associations, and greedy weighted sampling to focus training on dense, challenging neighborhoods, all built on a frozen DINOv2 backbone with parameter-efficient fine-tuning. Across eight benchmarks, SAGE achieves state-of-the-art results with high parameter efficiency, including 100% Recall@10 on SPED at 4096-D descriptors. The work emphasizes dynamic, data-driven learning dynamics and paves the way for adaptive sampling in deep metric learning for visual localization.

Abstract

Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation. Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training. We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatial-visual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining. We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues. During training we reconstruct an online geo-visual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape. To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler. Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9%, 95.8%, 94.5%, and 96.0% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100% Recall@10 on SPED only using 4096D global descriptors. Code and models will be released upon acceptance.

SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition

TL;DR

SAGE tackles Visual Place Recognition under severe appearance and viewpoint variation by introducing a slow-thinking training paradigm that continuously revises hard-neighborhood sampling via an online geo-visual graph. It combines Soft Probing for discriminative local patch emphasis, an InteractHead for cross-image associations, and greedy weighted sampling to focus training on dense, challenging neighborhoods, all built on a frozen DINOv2 backbone with parameter-efficient fine-tuning. Across eight benchmarks, SAGE achieves state-of-the-art results with high parameter efficiency, including 100% Recall@10 on SPED at 4096-D descriptors. The work emphasizes dynamic, data-driven learning dynamics and paves the way for adaptive sampling in deep metric learning for visual localization.

Abstract

Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation. Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training. We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatial-visual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining. We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues. During training we reconstruct an online geo-visual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape. To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler. Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9%, 95.8%, 94.5%, and 96.0% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100% Recall@10 on SPED only using 4096D global descriptors. Code and models will be released upon acceptance.

Paper Structure

This paper contains 19 sections, 8 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Performance and parameter efficiency of SAGE. (a–d) Recall@1 across four datasets at different global descriptor dimensions; SAGE achieves the best performance regardless of backbone and descriptor size. (e) Parameter comparison. By freezing DINOv2, SAGE substantially reduces trainable parameters compared to methods employing adapters or partial encoder tuning methods, demonstrating high efficiency. (f) Recall@1 performance compared with EMVP across the datasets.
  • Figure 2: SAGE overview. (a) Pipeline: a frozen DINOv2 with PEFT outputs tokens; SoftP amplifies informative patches, and InteractHead applies cross-image attention to form a robust global descriptor. (b) Online Graph Creation: each epoch builds a geo–visual affinity graph, keeping top-k neighbors and updating edges as embeddings evolve. (c) Greedy Weighted Sampling: seed by average affinity and expand cliques by adding the most connected nodes. (d) SoftP: A lightweight module that uses residual weighting to emphasize discriminative features prior to aggregation.
  • Figure 3: Visualization of spatial feature clustering using t-SNE for four methods and comparison of Average Intra-class Distance (AID). Numbers next to each class indicate intra-class distance (ID).
  • Figure 4: Qualitative results. SAGE consistently retrieves correct database images under severe challenges.
  • Figure 5: Visual comparison of importance heatmaps. SoftP shows a stronger focus on fine grained regions with high discriminative value than other methods overall.
  • ...and 4 more figures