Table of Contents
Fetching ...

Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

Feng Lu, Tong Jin, Canming Ye, Yunpeng Liu, Xiangyuan Lan, Chun Yuan

TL;DR

This work tackles robust global image representation for visual place recognition by moving away from the traditional backbone-plus-aggregator paradigm and toward implicit aggregation inside the transformer backbone. It introduces ImAge, which prepends $M$ aggregation tokens to patch tokens within a ViT, enabling the tokens to interact via self-attention and produce a $M\times D$ global descriptor from the last block. Key innovations include a token insertion strategy that places agg tokens at the junction of frozen and trainable blocks and an initialization method using $k$-means centers with $L2$ normalization. Empirical results across multiple VPR datasets demonstrate that ImAge surpasses explicit aggregators in accuracy and efficiency, achieving state-of-the-art performance on the MSLS leaderboard and strong generalization across diverse environments.

Abstract

Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https://github.com/lu-feng/image.

Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

TL;DR

This work tackles robust global image representation for visual place recognition by moving away from the traditional backbone-plus-aggregator paradigm and toward implicit aggregation inside the transformer backbone. It introduces ImAge, which prepends aggregation tokens to patch tokens within a ViT, enabling the tokens to interact via self-attention and produce a global descriptor from the last block. Key innovations include a token insertion strategy that places agg tokens at the junction of frozen and trainable blocks and an initialization method using -means centers with normalization. Empirical results across multiple VPR datasets demonstrate that ImAge surpasses explicit aggregators in accuracy and efficiency, achieving state-of-the-art performance on the MSLS leaderboard and strong generalization across diverse environments.

Abstract

Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains widely used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experimental results show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard. The code is available at https://github.com/lu-feng/image.

Paper Structure

This paper contains 24 sections, 3 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of three explicit aggregation methods and our ImAge. All methods use DINOv2-base-register as the backbone and are trained on the GSV-Cities dataset. ImAge achieves the best Recall@1 with the smallest descriptor dimension and the lowest inference time. Meanwhile, there is no extra explicit aggregator in our ImAge model.
  • Figure 2: Illustration of the previous paradigm and our ImAge paradigm. (a) The backbone-plus-aggregator paradigm with the traditional aggregator. (b) The backbone-plus-aggregator paradigm with a queries-based aggregator that introduces some queries to learn global information from the patch tokens. (c) Our ImAge only prepends a set of aggregation tokens to the patch tokens before a specific block in transformer backbone, making them interact globally via self-attention to achieve implicit aggregation. Notably, these aggregation tokens are simply initialized by the $k$-means algorithm.
  • Figure 3: Illustration of 4 insertion strategies for agg tokens. (a) Agg tokens are added before all transformer blocks. (b) Agg tokens are added at the junction between frozen and trainable blocks (our strategy). (c) Agg tokens are added at a deeper tunable block. (d) Agg tokens are added incrementally across multiple blocks rather than all at once.
  • Figure 4: Qualitative results. In these four challenging scenarios (involving dynamic objects, severe viewpoint variations, condition changes, etc.), our proposed ImAge method consistently retrieves the correct results from the database, while other methods all return the wrong images.
  • Figure 5: The visualization of the attention weights of our agg tokens to patch tokens. The first column (a) represents the input images. The middle 2-5 columns (b) separately display the attention weights of a single agg token to all patch tokens (reshaped to restore spatial position), meaning each image shows the attention of only one agg token. The last column (c) shows the merged attention of all 8 agg tokens. The first five examples (i.e., five rows) show five different places, with buildings, vegetation, and dynamic interference. While different agg tokens attend to distinct regions (or objects) in the images, they consistently focus on stable and discriminative areas (e.g., buildings and vegetation), while largely ignoring variable elements (e.g., cars). The sixth and seventh examples show two images taken at the same place in different seasons. Our agg tokens can consistently focus on buildings (and some discriminative regions where the terrain and railroad tracks change). The last two examples demonstrate that agg tokens can consistently focus on buildings and landmarks even after undergoing severe lighting changes.
  • ...and 3 more figures