Table of Contents
Fetching ...

D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

Zheyuan Zhang, Jiwei Zhang, Boyu Zhou, Linzhimeng Duan, Hong Chen

TL;DR

The paper introduces D^2-VPR, a parameter-efficient visual place recognition method built on visual foundation models. It employs a two-stage training strategy (knowledge distillation and fine-tuning) complemented by a Distillation Recovery Module to align teacher-student features and a Top-Down-attention-based Deformable Aggregator for adaptive region pooling. The approach achieves competitive performance on standard VPR benchmarks while dramatically reducing parameters (~64%) and MACs (~63%) compared to CricaVPR, enabling deployment on resource-constrained devices. This work demonstrates how distillation and deformable, semantically guided pooling can preserve foundation-model strengths in a lean VPR architecture.

Abstract

Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2's exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose $D^{2}$-VPR, a $D$istillation- and $D$eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% and MACs by about 62.6% (compared to CricaVPR).Code is available at https://github.com/tony19980810/D2VPR.

D$^{2}$-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

TL;DR

The paper introduces D^2-VPR, a parameter-efficient visual place recognition method built on visual foundation models. It employs a two-stage training strategy (knowledge distillation and fine-tuning) complemented by a Distillation Recovery Module to align teacher-student features and a Top-Down-attention-based Deformable Aggregator for adaptive region pooling. The approach achieves competitive performance on standard VPR benchmarks while dramatically reducing parameters (~64%) and MACs (~63%) compared to CricaVPR, enabling deployment on resource-constrained devices. This work demonstrates how distillation and deformable, semantically guided pooling can preserve foundation-model strengths in a lean VPR architecture.

Abstract

Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2's exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose -VPR, a istillation- and eformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% and MACs by about 62.6% (compared to CricaVPR).Code is available at https://github.com/tony19980810/D2VPR.

Paper Structure

This paper contains 18 sections, 11 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The comparison of average R@5 against multiply-accumulate operations (MACs) and parameter count on Pitts30k, MSLS-val, and SPED (with image size of 224×224) demonstrates that our model achieves competitive performance despite significantly reduced MACs and parameter count, striking an effective trade-off.
  • Figure 2: Two training stages of our VPR model.
  • Figure 3: Top-down-attention-based deformable aggregator.
  • Figure 4: Qualitative VPR comparison results. Our method demonstrates competitive performance compared to these DINOv2-based SOTA models under these challenging cases: long-term appearance changes (first row), drastic lighting variations (second row), perceptual aliasing (third row), and viewpoint changes (fourth row). Green indicates the right match while red is for the wrong one. Key matching regions are highlighted with red dashed boxes.
  • Figure 5: Method comparison of inference and computational speed on AmsterTime. The inference time includes the inference of both the database and the query set. The batch size is 32. SelaVPR performs calculations in a two-stage manner. PCA is not used here.
  • ...and 2 more figures