Table of Contents
Fetching ...

TeTRA-VPR: A Ternary Transformer Approach for Compact Visual Place Recognition

Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan

TL;DR

TeTRA tackles the challenge of running high-accuracy visual place recognition on memory- and compute-constrained robotics platforms. It introduces a two-stage training pipeline that quantizes a Vision Transformer backbone to ternary weights and binarizes the final embeddings, aided by progressive distillation and multi-level supervision from a strong teacher. The approach delivers up to 69% memory reduction and 35% latency reduction with equal or improved recall@1 on standard VPR benchmarks, outperforming efficient CNN-based baselines while remaining competitive with full-precision transformers. The work demonstrates a practical route to deploy robust VPR in real-world, power-constrained robotics and points to future extensions in sequential and two-stage retrieval systems.

Abstract

Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quantizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncompressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory-limited robotic platforms, making TeTRA an appealing solution for real-world deployment.

TeTRA-VPR: A Ternary Transformer Approach for Compact Visual Place Recognition

TL;DR

TeTRA tackles the challenge of running high-accuracy visual place recognition on memory- and compute-constrained robotics platforms. It introduces a two-stage training pipeline that quantizes a Vision Transformer backbone to ternary weights and binarizes the final embeddings, aided by progressive distillation and multi-level supervision from a strong teacher. The approach delivers up to 69% memory reduction and 35% latency reduction with equal or improved recall@1 on standard VPR benchmarks, outperforming efficient CNN-based baselines while remaining competitive with full-precision transformers. The work demonstrates a practical route to deploy robust VPR in real-world, power-constrained robotics and points to future extensions in sequential and two-stage retrieval systems.

Abstract

Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quantizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncompressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory-limited robotic platforms, making TeTRA an appealing solution for real-world deployment.

Paper Structure

This paper contains 20 sections, 16 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: TeTRA block diagram illustrating the ternary and binary training pipeline. The pre-training stage employs distillation-based, progressive quantization-aware training with attention, classification, and patch token losses. During fine-tuning, all backbone layers except the last are frozen while supervised contrastive learning is performed using a multi-similarity loss.
  • Figure 2: Radar plot of normalized metrics comparing inverse memory usage efficiency (Mem), matching speed (Lat), and R@1 accuracy across multiple datasets. Higher values indicate better performance for each metric. The results show that DinoV2-based models consume more resources, whereas TeTRA achieves higher R@1 accuracy, especially on appearance change datasets while using less memory than CosPlace.
  • Figure 3: Line plot demonstrating the trade-off between recall@1 accuracy and resource consumption (latency and memory usage) on the Tokyo247 dataset. Each line represents a single model assessed across different descriptor sizes, with selected points annotated to indicate the descriptor dimension (e.g., 1024D for 1024 dimensions).
  • Figure 4: Trade-offs in the VPR pipeline. This line plot illustrates Recall@1 accuracy, memory consumption, and latency, capturing the total computational overhead of feature extraction and matching. The figure highlights the balance between achieving high recognition performance and managing resources.