TSCM: A Teacher-Student Model for Vision Place Recognition Using Cross-Metric Knowledge Distillation

Yehui Shen; Mingmin Liu; Huimin Lu; Xieyuanli Chen

TSCM: A Teacher-Student Model for Vision Place Recognition Using Cross-Metric Knowledge Distillation

Yehui Shen, Mingmin Liu, Huimin Lu, Xieyuanli Chen

TL;DR

The paper tackles robust visual place recognition under environmental variations by introducing TSCM, a teacher-student framework that uses cross-metric knowledge distillation to bridge the gap between a high-capacity teacher and a lightweight student. The approach integrates a ResNet-ViT-InterTransformer teacher with a compact ResNet-based student and a novel cross-metric loss $L_{ ext{total}} = L_{ ext{hard}} + L_{ ext{soft}} + L_{ ext{cm}}$, where $L_{ ext{cm}} = \sum_i d(S(a_i) - T(p_i)) + d(S(p_i) - T(a_i))$, to enforce cross-model descriptor relationships. Experiments on Pittsburgh30k and Pittsburgh250k demonstrate that the student not only approaches but can exceed the teacher's VPR accuracy while offering substantially reduced parameters and faster inference, achieving descriptor generation in about $1.3$ ms and matching in under $0.6$ ms per query for a 10k database. The work shows strong ablations confirming the efficacy of cross-metric KD over traditional KD methods and underscores its potential for real-time robotic navigation on resource-constrained platforms. The code is released to facilitate adoption and reproducibility.

Abstract

Visual place recognition (VPR) plays a pivotal role in autonomous exploration and navigation of mobile robots within complex outdoor environments. While cost-effective and easily deployed, camera sensors are sensitive to lighting and weather changes, and even slight image alterations can greatly affect VPR efficiency and precision. Existing methods overcome this by exploiting powerful yet large networks, leading to significant consumption of computational resources. In this paper, we propose a high-performance teacher and lightweight student distillation framework called TSCM. It exploits our devised cross-metric knowledge distillation to narrow the performance gap between the teacher and student models, maintaining superior performance while enabling minimal computational load during deployment. We conduct comprehensive evaluations on large-scale datasets, namely Pittsburgh30k and Pittsburgh250k. Experimental results demonstrate the superiority of our method over baseline models in terms of recognition accuracy and model parameter efficiency. Moreover, our ablation studies show that the proposed knowledge distillation technique surpasses other counterparts. The code of our method has been released at https://github.com/nubot-nudt/TSCM.

TSCM: A Teacher-Student Model for Vision Place Recognition Using Cross-Metric Knowledge Distillation

TL;DR

, where

, to enforce cross-model descriptor relationships. Experiments on Pittsburgh30k and Pittsburgh250k demonstrate that the student not only approaches but can exceed the teacher's VPR accuracy while offering substantially reduced parameters and faster inference, achieving descriptor generation in about

ms and matching in under

ms per query for a 10k database. The work shows strong ablations confirming the efficacy of cross-metric KD over traditional KD methods and underscores its potential for real-time robotic navigation on resource-constrained platforms. The code is released to facilitate adoption and reproducibility.

Abstract

Paper Structure (11 sections, 9 equations, 7 figures, 4 tables)

This paper contains 11 sections, 9 equations, 7 figures, 4 tables.

Introduction
Related Work
Our Approach
Teacher-Student Model
Cross-Metric Knowledge Distillation
Experimental Evaluation
Experimental Setup
VPR Performance
Ablation Studies and Insights
Computational Efficiency and Runtime
Conclusion

Figures (7)

Figure 1: Our model consists of cross-metric knowledge distillation and place recognition deployment. It uses cross-metric learning to transfer knowledge from teacher to student offline. During online inference, it uses the lightweight student model to generate descriptors from the input image and identify potential places by comparing them against those stored in the database.
Figure 2: The pipeline overview of our proposed teacher network. It processes input images through a multi-stage feature extraction process using ResNet and Vision Transformer (ViT). The extracted features are further processed by NetVLAD and MLPs to create initial descriptors. The final global descriptor is the concatenation of features from all branches.
Figure 3: The structure of the Inter-Transformer Encoder
Figure 4: The overview of the student network
Figure 5: Comparison of different knowledge distillation (KD) strategies on triplet loss. $T(\cdot)$ represents the output of the teacher model, and $S(\cdot)$ represents the output of the student model. $a$, $p$, $n$ represent anchor, positive, and negative samples. Solid lines indicate distances to be reduced, while dashed lines indicate distances to be increased. The left figure shows the traditional KD method. The middle one is a fully connected knowledge distillation. The right figure illustrates our proposed strategy, which exploits constraints within each model and facilitates cross-model interactions, enhancing supervision for knowledge distillation in place recognition.
...and 2 more figures

TSCM: A Teacher-Student Model for Vision Place Recognition Using Cross-Metric Knowledge Distillation

TL;DR

Abstract

TSCM: A Teacher-Student Model for Vision Place Recognition Using Cross-Metric Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)