StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition
Yanqing Shen, Sanping Zhou, Jingwen Fu, Ruotong Wang, Shitao Chen, Nanning Zheng
TL;DR
StructVPR++ addresses visual place recognition under strong environmental variation by distilling segmentation-derived structure and semantics into RGB global representations. The method decouples label features from global features and uses a two-stage training scheme with group-partitioned, sample-weighted distillation to produce a label-aware RGB model that performs competitively with two-stage methods while remaining RGB-only at deployment. Key contributions include segmentation label map encoding (SLME), explicit semantic alignment via label features, a group-partition strategy, and a sample-based weighting function for distillation. Empirical results across MSLS, Nordland, and Pitts30k show consistent recall gains (5–23% at Recall@1) and favorable latency, highlighting practical improvements in accuracy-efficiency trade-offs for real-world VPR systems.
Abstract
Visual place recognition is a challenging task for autonomous driving and robotics, which is usually considered as an image retrieval problem. A commonly used two-stage strategy involves global retrieval followed by re-ranking using patch-level descriptors. Most deep learning-based methods in an end-to-end manner cannot extract global features with sufficient semantic information from RGB images. In contrast, re-ranking can utilize more explicit structural and semantic information in one-to-one matching process, but it is time-consuming. To bridge the gap between global retrieval and re-ranking and achieve a good trade-off between accuracy and efficiency, we propose StructVPR++, a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation. Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs without requiring segmentation during deployment. Furthermore, we introduce a sample-wise weighted distillation strategy that prioritizes reliable training pairs while suppressing noisy ones. Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1 and even outperforms many two-stage approaches, achieving real-time efficiency with a single RGB input.
