Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

Yu Kiu; Lau; Chao Chen; Ge Jin; Chen Feng

Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

Yu Kiu, Lau, Chao Chen, Ge Jin, Chen Feng

TL;DR

Adapt-STformer addresses the practical need for flexible, efficient Seq-VPR by introducing the Recurrent Deformable Transformer Encoder (Recurrent-DTE), which fuses spatio-temporal information across frames in temporal order within a single module. The framework uses a CCT384 backbone for efficient feature extraction and aggregates per-frame refinements with SeqGeM and SeqVLAD to form a discriminative sequential descriptor. Empirical results on Nordland, Oxford, and NuScenes show notable gains in recall and substantial improvements in inference speed and memory efficiency compared to strong baselines, demonstrating real-time viability under varying sequence lengths. This work provides a practical foundation for scalable, flexible Seq-VPR that can adapt to real-world constraints without sacrificing performance.

Abstract

Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively; however, existing approaches prioritize performance at the expense of flexibility and efficiency. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (seq-length), deliver fast inference, and have low memory usage to meet real-time constraints. To our knowledge, no existing transformer-based Seq-VPR method achieves both flexibility and efficiency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequential frames. This design naturally supports variable seq-lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt-STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% compared to the second-best baseline.

Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

TL;DR

Abstract

Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)