Table of Contents
Fetching ...

TAT-VPR: Ternary Adaptive Transformer for Dynamic and Efficient Visual Place Recognition

Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan

TL;DR

The paper addresses the resource bottleneck of transformer-based visual place recognition for real-time SLAM loop closure on low-power platforms. It introduces TAT-VPR, a ternary-quantized ViT backbone with an adaptive activation-sparsity gate and a two-stage distillation pipeline from a full-precision BoQ teacher, followed by targeted fine-tuning for retrieval. The approach achieves dynamic inference-cost control, delivering up to 40% TOps savings and about 5× memory reduction while maintaining near state-of-the-art Recall@1 on standard VPR benchmarks. The results demonstrate robust performance under appearance changes and are suitable for micro-UAV and embedded SLAM stacks, providing a practical, adaptive VPR solution.

Abstract

TAT-VPR is a ternary-quantized transformer that brings dynamic accuracy-efficiency trade-offs to visual SLAM loop-closure. By fusing ternary weights with a learned activation-sparsity gate, the model can control computation by up to 40% at run-time without degrading performance (Recall@1). The proposed two-stage distillation pipeline preserves descriptor quality, letting it run on micro-UAV and embedded SLAM stacks while matching state-of-the-art localization accuracy.

TAT-VPR: Ternary Adaptive Transformer for Dynamic and Efficient Visual Place Recognition

TL;DR

The paper addresses the resource bottleneck of transformer-based visual place recognition for real-time SLAM loop closure on low-power platforms. It introduces TAT-VPR, a ternary-quantized ViT backbone with an adaptive activation-sparsity gate and a two-stage distillation pipeline from a full-precision BoQ teacher, followed by targeted fine-tuning for retrieval. The approach achieves dynamic inference-cost control, delivering up to 40% TOps savings and about 5× memory reduction while maintaining near state-of-the-art Recall@1 on standard VPR benchmarks. The results demonstrate robust performance under appearance changes and are suitable for micro-UAV and embedded SLAM stacks, providing a practical, adaptive VPR solution.

Abstract

TAT-VPR is a ternary-quantized transformer that brings dynamic accuracy-efficiency trade-offs to visual SLAM loop-closure. By fusing ternary weights with a learned activation-sparsity gate, the model can control computation by up to 40% at run-time without degrading performance (Recall@1). The proposed two-stage distillation pipeline preserves descriptor quality, letting it run on micro-UAV and embedded SLAM stacks while matching state-of-the-art localization accuracy.

Paper Structure

This paper contains 8 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the TAT-VPR pre-training pipeline. A full-precision DINOv2-BoQ teacher boq (purple, frozen) provides token-level supervision to a ternary student transformer (green). During training, the student applies a top-k sparse activation filter. A distillation loss is computed between teacher and student tokens to guide compression-aware representation learning.
  • Figure 2: (A) Recall@1 versus Tera-Operations (TOPs) for a feature‐extraction forward pass, showing TAT-VPR curves at activation sparsity levels from 0% up to 60%. (B) Recall@1 versus memory footprint on the Pitts30k dataset, highlighting memory savings from ternary-weight backbones.