Table of Contents
Fetching ...

ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains

Pavel Suma, Giorgos Kordopatis-Zilos, Yannis Kalantidis, Giorgos Tolias

Abstract

Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost. Code available at: https://github.com/pavelsuma/ELViS/

ELViS: Efficient Visual Similarity from Local Descriptors that Generalizes Across Domains

Abstract

Large-scale instance-level training data is scarce, so models are typically trained on domain-specific datasets. Yet in real-world retrieval, they must handle diverse domains, making generalization to unseen data critical. We introduce ELViS, an image-to-image similarity model that generalizes effectively to unseen domains. Unlike conventional approaches, our model operates in similarity space rather than representation space, promoting cross-domain transfer. It leverages local descriptor correspondences, refines their similarities through an optimal transport step with data-dependent gains that suppress uninformative descriptors, and aggregates strong correspondences via a voting process into an image-level similarity. This design injects strong inductive biases, yielding a simple, efficient, and interpretable model. To assess generalization, we compile a benchmark of eight datasets spanning landmarks, artworks, products, and multi-domain collections, and evaluate ELViS as a re-ranking method. Our experiments show that ELViS outperforms competing methods by a large margin in out-of-domain scenarios and on average, while requiring only a fraction of their computational cost. Code available at: https://github.com/pavelsuma/ELViS/

Paper Structure

This paper contains 20 sections, 6 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Performance vs. time. Average performance across 8 datasets and multiple domains for fixed numbers of re-ranked images indicated with text labels. All models are trained on the landmarks domain (GLDv2). Runtime is estimated from model latencies reported in Table \ref{['tab:flops']}.
  • Figure 2: Detailed overview of ELViS. The similarity matrix is refined using optimal transport with descriptor-dependent dustbin gains. The strongest local similarities per descriptor are then selected and transformed element-wise by a learned function $f$, before being sum-aggregated into a scalar global similarity. During training, a modified BCE loss with a learnable function $g$ reshapes the penalty curve; $g$ is used only for training and is expandable at inference.
  • Figure 3: Shape of the learned univariate functions $f$ (left) and $g$ (right). Although parameterized as MLPs, both functions learn well-behaved scalar transformations that effectively separate matching and non-matching distributions. The distributions of input values are visualized separately for positive and negative image pairs, sampled during training.
  • Figure 4: Visualization of the 25 strongest correspondences (votes) among $s_i, s_j$ (left) and $s_i^\prime, s_j^\prime$ (right) before (left) and after (right) refinement with optimal transport. Red (yellow) represents high (low) similarity. Raw similarity values in $\mathbf{S}$ (left) and values in $\mathbf{S}^\prime$ after passing them through $f$ (right) are used. Heatmaps represent the dustbins values by evaluating $h$ densely for all patches in both images; bright values indicate large dustbin gain and uninformative descriptors.
  • Figure 4: Hybrid architecture combining descriptor-based and similarity-based processing. In the hybrid model, AMES performs intra-image and inter-image descriptor processing with $5$ transformer blocks, then the refined output tokens are subsequently fed into ELViS.
  • ...and 7 more figures