Table of Contents
Fetching ...

VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition

Ahmad Khaliq, Ming Xu, Stephen Hausler, Michael Milford, Sourav Garg

TL;DR

VLAD-BuFF tackles Visual Place Recognition by addressing two core issues of VKAD-based aggregation: visual burstiness and the computational burden of high-dimensional local features. It introduces a burstiness-aware weighting mechanism based on intra-image self-similarity (soft count) and a PCA-initialized pre-pool projection to dramatically reduce descriptor dimensionality before aggregation, enabling faster retrieval with negligible loss in recall. Across nine public benchmarks, VLAD-BuFF achieves state-of-the-art Recall@1 and Recall@5, and its performance remains robust even when local feature dimensions are reduced by up to 12×, highlighting significant efficiency gains for real-time applications. Qualitative analyses reveal that the learned weighting downweights repetitive patterns such as shadows while upweighting distinctive elements, supporting the method’s potential applicability to broader VLAD-style and multi-scale aggregations in VPR.

Abstract

Visual Place Recognition (VPR) is a crucial component of many visual localization pipelines for embodied agents. VPR is often formulated as an image retrieval task aimed at jointly learning local features and an aggregation method. The current state-of-the-art VPR methods rely on VLAD aggregation, which can be trained to learn a weighted contribution of features through their soft assignment to cluster centers. However, this process has two key limitations. Firstly, the feature-to-cluster weighting does not account for over-represented repetitive structures within a cluster, e.g., shadows or window panes; this phenomenon is also referred to as the `burstiness' problem, classically solved by discounting repetitive features before aggregation. Secondly, feature to cluster comparisons are compute-intensive for state-of-the-art image encoders with high-dimensional local features. This paper addresses these limitations by introducing VLAD-BuFF with two novel contributions: i) a self-similarity based feature discounting mechanism to learn Burst-aware features within end-to-end VPR training, and ii) Fast Feature aggregation by reducing local feature dimensions specifically through PCA-initialized learnable pre-projection. We benchmark our method on 9 public datasets, where VLAD-BuFF sets a new state of the art. Our method is able to maintain its high recall even for 12x reduced local feature dimensions, thus enabling fast feature aggregation without compromising on recall. Through additional qualitative studies, we show how our proposed weighting method effectively downweights the non-distinctive features. Source code: https://github.com/Ahmedest61/VLAD-BuFF/.

VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition

TL;DR

VLAD-BuFF tackles Visual Place Recognition by addressing two core issues of VKAD-based aggregation: visual burstiness and the computational burden of high-dimensional local features. It introduces a burstiness-aware weighting mechanism based on intra-image self-similarity (soft count) and a PCA-initialized pre-pool projection to dramatically reduce descriptor dimensionality before aggregation, enabling faster retrieval with negligible loss in recall. Across nine public benchmarks, VLAD-BuFF achieves state-of-the-art Recall@1 and Recall@5, and its performance remains robust even when local feature dimensions are reduced by up to 12×, highlighting significant efficiency gains for real-time applications. Qualitative analyses reveal that the learned weighting downweights repetitive patterns such as shadows while upweighting distinctive elements, supporting the method’s potential applicability to broader VLAD-style and multi-scale aggregations in VPR.

Abstract

Visual Place Recognition (VPR) is a crucial component of many visual localization pipelines for embodied agents. VPR is often formulated as an image retrieval task aimed at jointly learning local features and an aggregation method. The current state-of-the-art VPR methods rely on VLAD aggregation, which can be trained to learn a weighted contribution of features through their soft assignment to cluster centers. However, this process has two key limitations. Firstly, the feature-to-cluster weighting does not account for over-represented repetitive structures within a cluster, e.g., shadows or window panes; this phenomenon is also referred to as the `burstiness' problem, classically solved by discounting repetitive features before aggregation. Secondly, feature to cluster comparisons are compute-intensive for state-of-the-art image encoders with high-dimensional local features. This paper addresses these limitations by introducing VLAD-BuFF with two novel contributions: i) a self-similarity based feature discounting mechanism to learn Burst-aware features within end-to-end VPR training, and ii) Fast Feature aggregation by reducing local feature dimensions specifically through PCA-initialized learnable pre-projection. We benchmark our method on 9 public datasets, where VLAD-BuFF sets a new state of the art. Our method is able to maintain its high recall even for 12x reduced local feature dimensions, thus enabling fast feature aggregation without compromising on recall. Through additional qualitative studies, we show how our proposed weighting method effectively downweights the non-distinctive features. Source code: https://github.com/Ahmedest61/VLAD-BuFF/.
Paper Structure (41 sections, 5 equations, 6 figures, 4 tables)

This paper contains 41 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An illustration of our proposed method. Left: within a typical VLAD aggregation pipeline, we propose a pre-pool PCA projection layer (dotted box) to reduce the local feature dimensions (from $D$ to $D'$), enabling a compute-efficient aggregation. For a given cluster ($C_k$) on a unit hypersphere, residual vectors (gray arrows) are shown between centroids (tiny circles) and local features (colored squares). Right: for a set of local features assigned to cluster $C_0$, we show their weighting pattern through histograms (color corresponds to local features) as well as zoomed image patches with overlaid masks (yellow represents a high weight value). For vanilla NetVLAD's weighting based solely on soft assignment, several features on shadows (blue in histogram) cause a burst and are weighted highly, as opposed to our proposed method which balances out the weights on the burst but has a high weight value for features on the signboard (green in histogram). This variation in the weighting pattern changes the relative contribution of local features to the aggregated vector. This can be observed from the change in the direction of the aggregated vector shown as a black arrow in $C_0$ at the far right; our method moves the aggregated vector closer to the highly-weighted signboard feature.
  • Figure 3: Qualitative Analysis: Columns represent a query image, an incorrect match (negative) selected by vanilla NetVLAD and a correct match (positive) selected by VLAD-BuFF. Mask colors represent weight values with lowest starting in blue and increasing through green to yellow. The image patches (bottom) are scaled versions of the local features marked with red boxes (top). Our proposed burstiness weighting selects specific features while avoiding repetitive patterns, as opposed to vanilla soft-assignment weighting.
  • Figure 4: In this example from the Nordland dataset, both the vanilla soft assignment (2nd row) and our proposed method (3rd row) select image regions lying just beneath the tree canopies (first column). In the positive image (third column), both the methods select some features on the railway tracks (highlighted in yellow). However, our method downweights these features relative to those below the tree canopies (more yellow than tracks), thus improving the query-positive matching.
  • Figure 5: In this example from the St Lucia dataset, vanilla soft assignment results in several highly-weighted features, including those on trees' shadows (query and negative) and vehicles (positive). However, our proposed weighting (3rd row) downweights all the features found on shadows and vehicles, and instead selects a signboard with a consistent weighting pattern between the query-positive pair.
  • Figure 6: In this example from the Pitts30k dataset, we highlight the improved behavior of local features, learnt through VLAD-BuFF with burstiness weighting. Thus, we show the soft assignment weighting using both the vanilla NetVLAD model as well as the proposed VLAD-BuFF model. The weighting patterns for the negative and the positive remain similar across the rows. However, in the query image, it can be observed that VLAD-BuFF's soft assignment selects the overlapping region between the tree and the building at the bottom center of the image, whereas NetVLAD's soft assigment selects the trees at the bottom right. The former is more consistent with the feature selection in the positive, thus improving the query-positive matching. Note that the burstiness weighting (4th row) had only a slight impact on top of the soft assignment weighting (3rd row), thus the variation in the weighting pattern between NetVLAD and VLAD-BuFF is attributed more to feature-to-centroid distances than feature-to-feature distances in this case.
  • ...and 1 more figures