Table of Contents
Fetching ...

Distillation Improves Visual Place Recognition for Low Quality Images

Anbang Yang, Ge Jin, Junjie Huang, Yao Wang, John-Ross Rizzo, Chen Feng

TL;DR

The paper tackles the problem of visual place recognition (VPR) performance degradation when query images are transmitted at low quality due to bandwidth constraints. It introduces a model-agnostic, two-branch knowledge-distillation framework that transfers information from a teacher operating on high-quality images to a student processing low-quality inputs, using Inter-Channel Correlation Knowledge Distillation (ICKD), Mean Squared Error (MSE), and a Weakly Supervised Triplet Ranking Loss, with a composite objective $L = L_{ heta 1} + \alpha L_{ heta 2} + \beta L_{ heta 3}$. The authors validate the approach across multiple VPR methods and datasets under JPEG compression, resolution reduction, and video quantization, and also curate a video-based VPR dataset to address data scarcity. Results show significant recall improvements in most settings, demonstrating the method's generalization to different degradations and modalities, and highlighting a practical path for reliable VPR in resource-constrained environments. This work advances VPR by combining distillation with multi-loss supervision to preserve discriminative descriptor structure when input quality is compromised.

Abstract

Real-time visual localization often utilizes online computing, for which query images or videos are transmitted to remote servers for visual place recognition (VPR). However, limited network bandwidth necessitates image-quality reduction and thus the degradation of global image descriptors, reducing VPR accuracy. We address this issue at the descriptor extraction level with a knowledge-distillation methodology that learns feature representations from high-quality images to extract more discriminative descriptors from low-quality images. Our approach includes the Inter-channel Correlation Knowledge Distillation (ICKD) loss, Mean Squared Error (MSE) loss, and Triplet loss. We validate the proposed losses on multiple VPR methods and datasets subjected to JPEG compression, resolution reduction, and video quantization. We obtain significant improvements in VPR recall rates under all three tested modalities of lowered image quality. Furthermore, we fill a gap in VPR literature on video-based data and its influence on VPR performance. This work contributes to more reliable place recognition in resource-constrained environments.

Distillation Improves Visual Place Recognition for Low Quality Images

TL;DR

The paper tackles the problem of visual place recognition (VPR) performance degradation when query images are transmitted at low quality due to bandwidth constraints. It introduces a model-agnostic, two-branch knowledge-distillation framework that transfers information from a teacher operating on high-quality images to a student processing low-quality inputs, using Inter-Channel Correlation Knowledge Distillation (ICKD), Mean Squared Error (MSE), and a Weakly Supervised Triplet Ranking Loss, with a composite objective . The authors validate the approach across multiple VPR methods and datasets under JPEG compression, resolution reduction, and video quantization, and also curate a video-based VPR dataset to address data scarcity. Results show significant recall improvements in most settings, demonstrating the method's generalization to different degradations and modalities, and highlighting a practical path for reliable VPR in resource-constrained environments. This work advances VPR by combining distillation with multi-loss supervision to preserve discriminative descriptor structure when input quality is compromised.

Abstract

Real-time visual localization often utilizes online computing, for which query images or videos are transmitted to remote servers for visual place recognition (VPR). However, limited network bandwidth necessitates image-quality reduction and thus the degradation of global image descriptors, reducing VPR accuracy. We address this issue at the descriptor extraction level with a knowledge-distillation methodology that learns feature representations from high-quality images to extract more discriminative descriptors from low-quality images. Our approach includes the Inter-channel Correlation Knowledge Distillation (ICKD) loss, Mean Squared Error (MSE) loss, and Triplet loss. We validate the proposed losses on multiple VPR methods and datasets subjected to JPEG compression, resolution reduction, and video quantization. We obtain significant improvements in VPR recall rates under all three tested modalities of lowered image quality. Furthermore, we fill a gap in VPR literature on video-based data and its influence on VPR performance. This work contributes to more reliable place recognition in resource-constrained environments.
Paper Structure (27 sections, 5 equations, 5 figures, 3 tables)

This paper contains 27 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Enhanced Low Quality Image Retrieval: After knowledge distillation from high quality images, this JPEG-compressed query image from the Mapillary SLS dataset (top center) is correctly localized by multiple VPR methods (top row), whereas they failed beforehand (bottom row).
  • Figure 2: Proposed Knowledge Distillation Methodology: Through ICKD loss between latent codes $z^h$ and $z^l$, MSE loss between global descriptors $v^h$ and $v^l$, and triplet loss on $v^l$, the student branch $f_s$ learns the teacher branch $f_t$'s knowledge of high quality images $I^h$ to extract more discriminative, representative $v^l$ from low quality images $I^l$.
  • Figure 3: Recall Rates for All Losses: For each method, we compare $f_s$ after distillation with loss combinations as well as pretrained weights. Different marker shapes indicate different combinations of ICKD and MSE losses, whereas including the triplet loss is shown with a dashed line. For readability, the y-axis of each subplot is independently scaled.
  • Figure 4: Activation Maps: The selected $I^l$ queries (top of figure) are successfully recalled by $f_s$, while $f_t$ fails. For CricaVPR, the focus of its ICKD-trained $f_s$ is visualized by averaging its $z^l$ across the channel dimension in accordance with its authors. For NetVLAD, we averaged a weighted sum of its MSE-trained $z^l$ via tentative cluster assignments from $g^{\phi}_s$. For MixVPR, since its authors did not visualize activations, we masked small regions of $I^l$ and considered the resulting change in its ICKD-trained $v_l$ as activations, creating block-like patches on its heatmap.
  • Figure 5: Results on Our Dataset: The recall rates of Pitts250k-trained $f_s$ are obtained for each quantization parameter. For ease of visualization, the latter is converted into the video bitrate metric with lower bitrate indicating higher quantization. The left two plots were calculated with a VPR threshold of $1$m, and the right two have a $10$m threshold.