Table of Contents
Fetching ...

BoQ: A Place is Worth a Bag of Learnable Queries

Amar Ali-Bey, Brahim Chaib-draa, Philippe Giguère

TL;DR

Bag-of-Queries is introduced, which learns a set of global queries, designed to capture universal place-specific attributes, and surpasses two-stage re-trieval methods, such as Patch-NetVLAD, TransVPR and R2Former, all while being orders of magnitude faster and more efficient.

Abstract

In visual place recognition, accurately identifying and matching images of locations under varying environmental conditions and viewpoints remains a significant challenge. In this paper, we introduce a new technique, called Bag-of-Queries (BoQ), which learns a set of global queries designed to capture universal place-specific attributes. Unlike existing methods that employ self-attention and generate the queries directly from the input features, BoQ employs distinct learnable global queries, which probe the input features via cross-attention, ensuring consistent information aggregation. In addition, our technique provides an interpretable attention mechanism and integrates with both CNN and Vision Transformer backbones. The performance of BoQ is demonstrated through extensive experiments on 14 large-scale benchmarks. It consistently outperforms current state-of-the-art techniques including NetVLAD, MixVPR and EigenPlaces. Moreover, as a global retrieval technique (one-stage), BoQ surpasses two-stage retrieval methods, such as Patch-NetVLAD, TransVPR and R2Former, all while being orders of magnitude faster and more efficient. The code and model weights are publicly available at https://github.com/amaralibey/Bag-of-Queries.

BoQ: A Place is Worth a Bag of Learnable Queries

TL;DR

Bag-of-Queries is introduced, which learns a set of global queries, designed to capture universal place-specific attributes, and surpasses two-stage re-trieval methods, such as Patch-NetVLAD, TransVPR and R2Former, all while being orders of magnitude faster and more efficient.

Abstract

In visual place recognition, accurately identifying and matching images of locations under varying environmental conditions and viewpoints remains a significant challenge. In this paper, we introduce a new technique, called Bag-of-Queries (BoQ), which learns a set of global queries designed to capture universal place-specific attributes. Unlike existing methods that employ self-attention and generate the queries directly from the input features, BoQ employs distinct learnable global queries, which probe the input features via cross-attention, ensuring consistent information aggregation. In addition, our technique provides an interpretable attention mechanism and integrates with both CNN and Vision Transformer backbones. The performance of BoQ is demonstrated through extensive experiments on 14 large-scale benchmarks. It consistently outperforms current state-of-the-art techniques including NetVLAD, MixVPR and EigenPlaces. Moreover, as a global retrieval technique (one-stage), BoQ surpasses two-stage retrieval methods, such as Patch-NetVLAD, TransVPR and R2Former, all while being orders of magnitude faster and more efficient. The code and model weights are publicly available at https://github.com/amaralibey/Bag-of-Queries.
Paper Structure (13 sections, 6 equations, 7 figures, 10 tables)

This paper contains 13 sections, 6 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Recall@1 performance comparison between our proposed technique, Bag-of-Queries (BoQ), and current state of the art methods, Conv-AP ali2022gsv, CosPlace berton2022rethinking, MixVPR ali2023mixvpr and EigenPlaces berton2023eigenplaces. ResNet-50 is used as backbone for all techniques. BoQ consistently achieves better performance in various environment conditions such as viewpoint changes (Pitts-250k torii2013visual, MapillarySLS warburg2020mapillary), seasonal changes (Nordland zaffar2021vpr), historical locations (AmsterTime yildiz2022amstertime) and extreme lightning and weather conditions (SVOX Berton_2021_svox).
  • Figure 2: Overall architecture of the Bag-of-Queries (BoQ) model. The input image is first processed by a backbone network to extract its local features, which are then sequentially refined in a cascade of Encoder units. Each BoQ block contains a set of learnable queries $\mathbf{Q}$ (Learned Bag of Queries), which undergo self-attention to integrate their shared information. The refined features $\mathbf{X}^i$ are then processed through cross-attention with $\mathbf{Q}$ for selective aggregation. Outputs from all BoQ blocks $(\mathbf{O}^1, \mathbf{O}^2, \dots, \mathbf{O}^L)$ are concatenated and linearly projected. The final global descriptor is L2-normalized to optimize it for subsequent similarity search.
  • Figure 3: Visualization of the cross attention weights between the input images and the learned queries. The three examples are from Nordland, Pitts30k and MSLS datasets, respectively. We selected four queries (among $64$) from the second BoQ block of a trained network. Vertically, we can see how the input image is aggregated by each query. The aggregation is done through the product of the weight with the input feature maps, resulting in one aggregated descriptor per query. Horizontally, we can see in each line how each query spans the input image. For example, the first query looks more for fine grained details, while the second looks more for large areas in the input images.
  • Figure 4: Detailed architecture of our model using ResNet-50 backbone and two BoQ blocks.
  • Figure 5: Weather and occlusions. The first row displays four images of the same location captured at different times, illustrating changes in the environment. Subsequent rows reveal the cross-attention scores between one learned query and the feature maps of the respective input image. In these heatmaps, regions with higher attention scores are indicated in warmer colors (red/yellow), signifying areas where the query is focusing more intensely. First row shows four images of the same place accross different times. The following four rows show the cross-attention scores of four selected learned queries on the feature maps of the input image.
  • ...and 2 more figures