Table of Contents
Fetching ...

AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval

Pavel Suma, Giorgos Kordopatis-Zilos, Ahmet Iscen, Giorgos Tolias

TL;DR

AMES tackles memory-efficient instance-level image retrieval by introducing a transformer-based AMES architecture that computes image-to-image similarity from two local descriptor sets with an asymmetric, memory-conscious design. It uses a projection f that combines binarization and a remapping to form input tokens, a learnable matching token, and alternating self- and cross-attention to capture intra- and inter-image interactions, producing a final similarity via a sigmoid classifier. The model supports full-precision and binarized variants, employs a global-local ensemble during re-ranking, and leverages distillation from a richer teacher to a lighter student to preserve performance under tight memory budgets. Empirical results on GLDv2 and ROxford/RParis demonstrate superior memory-performance trade-offs, showing that a binary, distillation-enabled AMES can achieve competitive accuracy with orders-of-magnitude memory reductions, and that a universal model remains robust to varying test-time descriptor counts.

Abstract

This work investigates the problem of instance-level image retrieval re-ranking with the constraint of memory efficiency, ultimately aiming to limit memory usage to 1KB per image. Departing from the prevalent focus on performance enhancements, this work prioritizes the crucial trade-off between performance and memory requirements. The proposed model uses a transformer-based architecture designed to estimate image-to-image similarity by capturing interactions within and across images based on their local descriptors. A distinctive property of the model is the capability for asymmetric similarity estimation. Database images are represented with a smaller number of descriptors compared to query images, enabling performance improvements without increasing memory consumption. To ensure adaptability across different applications, a universal model is introduced that adjusts to a varying number of local descriptors during the testing phase. Results on standard benchmarks demonstrate the superiority of our approach over both hand-crafted and learned models. In particular, compared with current state-of-the-art methods that overlook their memory footprint, our approach not only attains superior performance but does so with a significantly reduced memory footprint. The code and pretrained models are publicly available at: https://github.com/pavelsuma/ames

AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval

TL;DR

AMES tackles memory-efficient instance-level image retrieval by introducing a transformer-based AMES architecture that computes image-to-image similarity from two local descriptor sets with an asymmetric, memory-conscious design. It uses a projection f that combines binarization and a remapping to form input tokens, a learnable matching token, and alternating self- and cross-attention to capture intra- and inter-image interactions, producing a final similarity via a sigmoid classifier. The model supports full-precision and binarized variants, employs a global-local ensemble during re-ranking, and leverages distillation from a richer teacher to a lighter student to preserve performance under tight memory budgets. Empirical results on GLDv2 and ROxford/RParis demonstrate superior memory-performance trade-offs, showing that a binary, distillation-enabled AMES can achieve competitive accuracy with orders-of-magnitude memory reductions, and that a universal model remains robust to varying test-time descriptor counts.

Abstract

This work investigates the problem of instance-level image retrieval re-ranking with the constraint of memory efficiency, ultimately aiming to limit memory usage to 1KB per image. Departing from the prevalent focus on performance enhancements, this work prioritizes the crucial trade-off between performance and memory requirements. The proposed model uses a transformer-based architecture designed to estimate image-to-image similarity by capturing interactions within and across images based on their local descriptors. A distinctive property of the model is the capability for asymmetric similarity estimation. Database images are represented with a smaller number of descriptors compared to query images, enabling performance improvements without increasing memory consumption. To ensure adaptability across different applications, a universal model is introduced that adjusts to a varying number of local descriptors during the testing phase. Results on standard benchmarks demonstrate the superiority of our approach over both hand-crafted and learned models. In particular, compared with current state-of-the-art methods that overlook their memory footprint, our approach not only attains superior performance but does so with a significantly reduced memory footprint. The code and pretrained models are publicly available at: https://github.com/pavelsuma/ames
Paper Structure (14 sections, 6 equations, 16 figures, 10 tables)

This paper contains 14 sections, 6 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: AMES estimates the image-to-image similarity based on local descriptor sets. Top: 100 (query) vs.100 (database) descriptors. Bottom: memory-efficient and asymmetric variant with 100 vs.30 local descriptors. Circle size reflects descriptor importance within AMES; descriptors of the common object get higher importance.
  • Figure 2: Overview of the AMES model designed for estimating image-to-image similarity when provided two local descriptor sets, where one set has a smaller size. Descriptors are processed by projection ($f$) comprised of binarization ($b$) and re-mapping to the real coordinate space ($r$). During testing, binarization (re-mapping) is performed offline (online); therefore, we need to store only the binary vectors for the database images. Descriptors, together with a learnable matching token, form the input token set for a transformer-based architecture, which, together with a binary classifier, estimate the similarity. Sequentially, inter-image (self) and intra-image (cross) attention is performed by standard self-attention blocks alongside appropriate masking.
  • Figure 3: Performance vs. memory trade-off on $\mathcal{R}$OP+1M (top) and GLDv2 (bottom). All methods use global descriptors with PQ8 for initial ranking and ensemble similarity to re-rank $m=1600$ images. We vary the number of local descriptors $L_x^{\text{\tiny test}}$ for database images, which is shown with text labels, indicatively, for one variant. Binary/full-precision local descriptors denoted by bin/fp. All methods are trained and tested by us, within the same implementation framework.
  • Figure 3: Impact of asymmetry on performance. Experiment with varying number of descriptors $L_q^{\text{\tiny test}}$ (query) and $L_x^{\text{\tiny test}}$ (database) during testing. Global similarity performance is shown for reference.
  • Figure 4: Universal vs. specific AMES models. Single specific: one model trained with a fixed number of local descriptors $L_x^{\text{\tiny train}}$ (value in brackets) and tested with varying $L_x^{\text{\tiny test}}\xspace$. Many specific: six models in total, trained and tested per number of local descriptors ($L_x^{\text{\tiny test}}\xspace=L_x^{\text{\tiny train}}\xspace$). All models are without distillation.
  • ...and 11 more figures