Table of Contents
Fetching ...

StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition

Yanqing Shen, Sanping Zhou, Jingwen Fu, Ruotong Wang, Shitao Chen, Nanning Zheng

TL;DR

StructVPR++ addresses visual place recognition under strong environmental variation by distilling segmentation-derived structure and semantics into RGB global representations. The method decouples label features from global features and uses a two-stage training scheme with group-partitioned, sample-weighted distillation to produce a label-aware RGB model that performs competitively with two-stage methods while remaining RGB-only at deployment. Key contributions include segmentation label map encoding (SLME), explicit semantic alignment via label features, a group-partition strategy, and a sample-based weighting function for distillation. Empirical results across MSLS, Nordland, and Pitts30k show consistent recall gains (5–23% at Recall@1) and favorable latency, highlighting practical improvements in accuracy-efficiency trade-offs for real-world VPR systems.

Abstract

Visual place recognition is a challenging task for autonomous driving and robotics, which is usually considered as an image retrieval problem. A commonly used two-stage strategy involves global retrieval followed by re-ranking using patch-level descriptors. Most deep learning-based methods in an end-to-end manner cannot extract global features with sufficient semantic information from RGB images. In contrast, re-ranking can utilize more explicit structural and semantic information in one-to-one matching process, but it is time-consuming. To bridge the gap between global retrieval and re-ranking and achieve a good trade-off between accuracy and efficiency, we propose StructVPR++, a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation. Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs without requiring segmentation during deployment. Furthermore, we introduce a sample-wise weighted distillation strategy that prioritizes reliable training pairs while suppressing noisy ones. Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1 and even outperforms many two-stage approaches, achieving real-time efficiency with a single RGB input.

StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition

TL;DR

StructVPR++ addresses visual place recognition under strong environmental variation by distilling segmentation-derived structure and semantics into RGB global representations. The method decouples label features from global features and uses a two-stage training scheme with group-partitioned, sample-weighted distillation to produce a label-aware RGB model that performs competitively with two-stage methods while remaining RGB-only at deployment. Key contributions include segmentation label map encoding (SLME), explicit semantic alignment via label features, a group-partition strategy, and a sample-based weighting function for distillation. Empirical results across MSLS, Nordland, and Pitts30k show consistent recall gains (5–23% at Recall@1) and favorable latency, highlighting practical improvements in accuracy-efficiency trade-offs for real-world VPR systems.

Abstract

Visual place recognition is a challenging task for autonomous driving and robotics, which is usually considered as an image retrieval problem. A commonly used two-stage strategy involves global retrieval followed by re-ranking using patch-level descriptors. Most deep learning-based methods in an end-to-end manner cannot extract global features with sufficient semantic information from RGB images. In contrast, re-ranking can utilize more explicit structural and semantic information in one-to-one matching process, but it is time-consuming. To bridge the gap between global retrieval and re-ranking and achieve a good trade-off between accuracy and efficiency, we propose StructVPR++, a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation. Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs without requiring segmentation during deployment. Furthermore, we introduce a sample-wise weighted distillation strategy that prioritizes reliable training pairs while suppressing noisy ones. Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1 and even outperforms many two-stage approaches, achieving real-time efficiency with a single RGB input.

Paper Structure

This paper contains 25 sections, 17 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Examples of query images and ground truths. The marked number represents the recall performance of two pre-trained branches on ground truths. (a) shows the scene with illumination variation and seasonal changes, where segmentation images are more recognizable. (b) shows the scene with changing perspectives, where RGB images are more recognizable.
  • Figure 2: Challenging cases with large viewpoint variations from the same place. Existing methods often fail to align semantic meanings between image pairs, but it's difficult to only utilize structural information to complete recognition when lack of overlap.
  • Figure 3: Illustration of the proposed pipeline. We first train two branches with VPR supervision to extract structural and semantic knowledge, respectively. Next, offline group partition is performed using the frozen branches, and weights are assigned to the samples. Then weighted knowledge distillation and VPR supervision are performed in Stage II to train the RGB model using combined loss, and the label-aware RGB features can be obtained. During inference phase, StructVPR++ only uses the trained model in Stage II and does not perform segmentation. More importantly, it can maximize the efficiency of distilling high-quality knowledge. SLME represents the pre-coding process of segmentation images into standard CNN input format. MLP, LC, MC, and LMC are network structure modules used to obtain corresponding features. In seg-branch, A refers to the multi-level down-sampled masks, and Weights means the shared weighting module for separate label features. The schematic diagram on the lower left briefly illustrates the macro differences between the structural and semantic information.
  • Figure 4: Visualization of incorporated semantic classes. Shown from left to right are RGB images, segmentation images with incorporated labels, and original segmentation images. It can be seen that the label space after incorporating is cleaner for VPR.
  • Figure 5: Flowchart of feature computation. Visualization of multi-level concatenation layer and label-level concatenation layer.
  • ...and 5 more figures