Table of Contents
Fetching ...

EDTformer: An Efficient Decoder Transformer for Visual Place Recognition

Tong Jin, Feng Lu, Shuyu Hu, Chun Yuan, Yunpeng Liu

TL;DR

This work tackles visual place recognition by rethinking global feature aggregation with a decoder-based approach. EDTformer uses a cascade of simplified transformer decoder blocks and a learnable query set to extract context-rich, discriminative global descriptors from backbone features, while LoPA refines a frozen vision foundation model (DINOv2) in a memory- and parameter-efficient way. Empirical results across diverse benchmarks show state-of-the-art recall at top ranks with reduced descriptor size and lower training memory, demonstrating robustness to viewpoint, illumination, and modality changes. The combination of a lightweight decoder aggregator and efficient backbone adaptation offers a practical, scalable solution for single-stage VPR in resource-constrained settings. This approach has implications for real-world localization systems where fast, accurate place recognition is essential and training resources are limited.

Abstract

Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations. Specifically, we do this by formulating deep features as the keys and values, as well as a set of learnable parameters as the queries. Our EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to output the global representations. Moreover, to provide more powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-rank Parallel Adaptation (LoPA) method to enhance its performance in VPR, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at https://github.com/Tong-Jin01/EDTformer.

EDTformer: An Efficient Decoder Transformer for Visual Place Recognition

TL;DR

This work tackles visual place recognition by rethinking global feature aggregation with a decoder-based approach. EDTformer uses a cascade of simplified transformer decoder blocks and a learnable query set to extract context-rich, discriminative global descriptors from backbone features, while LoPA refines a frozen vision foundation model (DINOv2) in a memory- and parameter-efficient way. Empirical results across diverse benchmarks show state-of-the-art recall at top ranks with reduced descriptor size and lower training memory, demonstrating robustness to viewpoint, illumination, and modality changes. The combination of a lightweight decoder aggregator and efficient backbone adaptation offers a practical, scalable solution for single-stage VPR in resource-constrained settings. This approach has implications for real-world localization systems where fast, accurate place recognition is essential and training resources are limited.

Abstract

Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations. Specifically, we do this by formulating deep features as the keys and values, as well as a set of learnable parameters as the queries. Our EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to output the global representations. Moreover, to provide more powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-rank Parallel Adaptation (LoPA) method to enhance its performance in VPR, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at https://github.com/Tong-Jin01/EDTformer.

Paper Structure

This paper contains 18 sections, 13 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: The performance comparison (Recall@1) on multiple benchmark datasets between our method and current state-of-the-art VPR methods, such as MixVPR mixvpr, EngenPlaces eigenplaces, CricaVPR cricavpr, SelaVPR selavpr and BoQ BoQ. $\ddag$ We reproduce the results of BoQ by strictly following its training pipeline, except for keeping the same image size for both training ($224\times224$) and inference ($322\times322$) as our method. Our EDTformer consistently shows obvious advantages over other methods in diverse VPR scenarios, including viewpoint variations and condition changes (MSLS msls), severe lighting changes (Tokyo24/7 tokyo247), various low-quality and high-scene-depth place images (SPED sped), place image modality changes (AmsterTime amstertime) and varying weather conditions (SVOX svox).
  • Figure 2: Our pipeline to produce the robust and discriminative global representation for single-stage VPR. Firstly, the frozen backbone with Low-rank Parallel Adaptation (i.e., DINOv2 with LoPA) is employed to extract powerful deep features of the input image. Next, the features undergo a linear transformation and are fed into each cross-attention layer as the keys and values. Additionally, we initial a set of learnable queries as the input queries for the first self-attention layer. After passing through $L$ our simplified decoder blocks, we can obtain the learned queries which have aggregated the crucial contextual features for the VPR task. Then we use two fully connected layers: one for dimensionality reduction and the other for further information aggregation by adjusting the number of queries. Finally, the output features are flattened and L2-normalized as the global representation of the place image.
  • Figure 3: Illustration of our Low-rank Parallel Adaptation method. (a) is a standard transformer encoder block in ViT. (b) is the popular PETL method based on adapters, which usually inserts the trainable adapters into the encoder blocks of the frozen backbone. (c) is our proposed LoPA method. The intermediate features from each encoder block of the frozen DINOv2 are sequentially fed to the corresponding adaptation function together with the output from the previous adaptation function.
  • Figure 4: The R@1 and inference time comparison of different single-stage methods on Pitts30k. We consistently measure the inference time on an NVIDIA GeForce RTX 4090 GPU.
  • Figure 5: Qualitative results. In these challenging scenarios, our method successfully retrieves the correct images, while other methods commonly return the false places. For the first two examples, although some other methods obtain images geographically close to the query image, they exceed the set threshold (25m). For the third and fourth examples, despite image modality changes between the query and database images, our method still can retrieve the correct places by capturing the invariant and discriminative buildings. For the fifth and sixth examples, the query images are captured in natural scenes, suffering from severe condition variations and lacking discriminative landmarks. Nevertheless, our method can still match the correct place. In the seventh and eighth examples, other methods commonly return a false result due to the severe lighting changes. However, our method can produce robust and discriminative global descriptors, which can effectively handle the problem. For the last two examples, all methods fail when facing extremely difficult scenarios, in which viewpoint changes, domain variations, occlusions, dynamic objects and perceptual aliasing arise simultaneously.
  • ...and 1 more figures