Table of Contents
Fetching ...

Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

Yu Kiu, Lau, Chao Chen, Ge Jin, Chen Feng

TL;DR

Adapt-STformer addresses the practical need for flexible, efficient Seq-VPR by introducing the Recurrent Deformable Transformer Encoder (Recurrent-DTE), which fuses spatio-temporal information across frames in temporal order within a single module. The framework uses a CCT384 backbone for efficient feature extraction and aggregates per-frame refinements with SeqGeM and SeqVLAD to form a discriminative sequential descriptor. Empirical results on Nordland, Oxford, and NuScenes show notable gains in recall and substantial improvements in inference speed and memory efficiency compared to strong baselines, demonstrating real-time viability under varying sequence lengths. This work provides a practical foundation for scalable, flexible Seq-VPR that can adapt to real-world constraints without sacrificing performance.

Abstract

Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively; however, existing approaches prioritize performance at the expense of flexibility and efficiency. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (seq-length), deliver fast inference, and have low memory usage to meet real-time constraints. To our knowledge, no existing transformer-based Seq-VPR method achieves both flexibility and efficiency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequential frames. This design naturally supports variable seq-lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt-STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% compared to the second-best baseline.

Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

TL;DR

Adapt-STformer addresses the practical need for flexible, efficient Seq-VPR by introducing the Recurrent Deformable Transformer Encoder (Recurrent-DTE), which fuses spatio-temporal information across frames in temporal order within a single module. The framework uses a CCT384 backbone for efficient feature extraction and aggregates per-frame refinements with SeqGeM and SeqVLAD to form a discriminative sequential descriptor. Empirical results on Nordland, Oxford, and NuScenes show notable gains in recall and substantial improvements in inference speed and memory efficiency compared to strong baselines, demonstrating real-time viability under varying sequence lengths. This work provides a practical foundation for scalable, flexible Seq-VPR that can adapt to real-world constraints without sacrificing performance.

Abstract

Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively; however, existing approaches prioritize performance at the expense of flexibility and efficiency. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (seq-length), deliver fast inference, and have low memory usage to meet real-time constraints. To our knowledge, no existing transformer-based Seq-VPR method achieves both flexibility and efficiency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequential frames. This design naturally supports variable seq-lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt-STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% compared to the second-best baseline.

Paper Structure

This paper contains 9 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of Seq-VPR architectures. Top: Non-transformer methods feed backbone features directly to an aggregator. Middle: STformer handles spatial and temporal modeling with separate encoders before aggregation. Bottom: The proposed Adapt-STformer unifies spatio-temporal modeling in the Recurrent-DTE module before aggregation.
  • Figure 2: Proposed architecture of Adapt-STformer. A frame sequence $S$ enters the Encoder Stage, where the CCT384 backbone tokenizes each input frame, producing $F=\{f_t\}_{t=1}^{L}$. Recurrent-DTE Stage performs iterative recurrent spatio-temporal refinement on $F$: at iteration $t$, the previous DTE output provides the queries $Q_t=\hat{f}_{t-1}$ and the current frame supplies keys/values $K_t=V_t=f_t$, with $Q_{t=1}=f_{t=1}+\Delta$; this recurrence yields $\{\hat{f}_t\}_{t=1}^{L}$. Aggregation Stage stacks $\{\hat{f}_t\}$ into a tensor $\hat{F}$, applies SeqGeM to obtain $\tilde{F}$, and processes it with SeqVLAD into the final sequential descriptor.
  • Figure 3: Qualitative comparison of attention maps between our method and STformer on NuScenes. STformer fails in these cases, whereas our method succeeds in VPR matching. In the attention maps, blue denotes low-focus regions and yellow to red gradient denotes high-focus regions by the respective models.
  • Figure 4: Feature Refinement Across Adapt-STformer's Modules. Subfigures (b)--(d) are snapshots of the model's attention during inferencing stages, indicated by pixelated areas.
  • Figure 5: VPR Performance Under Time Constraints. We analyze the impacts of changing inference time constraints on the performance of Adapt-STformer and other SOTA VPR methods by streaming queries from 20 FPS to 60 FPS.
  • ...and 1 more figures