Table of Contents
Fetching ...

Exploiting Distribution Constraints for Scalable and Efficient Image Retrieval

Mohammad Omama, Po-han Li, Sandeep P. Chinchali

TL;DR

This paper tackles the scalability and efficiency gap in image retrieval by leveraging off-the-shelf foundation models and introducing two unsupervised techniques: AE-SVC, which imposes orthogonality, mean-centering, and unit-variance constraints on a projection autoencoder to minimize the variance of cosine similarities, and (SS)$_2$D, which distills a teacher cosine-space into multiple smaller, adaptive embeddings via KL divergence. Theoretical analysis shows that minimizing the cosine-variance enhances discriminative power, and empirical results across four datasets and several foundation models show AE-SVC achieves up to 16% retrieval gains, with (SS)$_2$D delivering up to 10% improvements at smaller embedding sizes and approaching an upper bound set by per-dimension distillation. The approach improves retrieval speed and storage efficiency by enabling smaller embeddings without retraining dataset-specific models, making foundation-model-based image retrieval more scalable for robotics and vision applications. Overall, the combination of variance-aware representation learning and one-shot adaptive embedding distillation offers a practical path to scalable, high-performance image retrieval in diverse domains.

Abstract

Image retrieval is crucial in robotics and computer vision, with downstream applications in robot place recognition and vision-based product recommendations. Modern retrieval systems face two key challenges: scalability and efficiency. State-of-the-art image retrieval systems train specific neural networks for each dataset, an approach that lacks scalability. Furthermore, since retrieval speed is directly proportional to embedding size, existing systems that use large embeddings lack efficiency. To tackle scalability, recent works propose using off-the-shelf foundation models. However, these models, though applicable across datasets, fall short in achieving performance comparable to that of dataset-specific models. Our key observation is that, while foundation models capture necessary subtleties for effective retrieval, the underlying distribution of their embedding space can negatively impact cosine similarity searches. We introduce Autoencoders with Strong Variance Constraints (AE-SVC), which, when used for projection, significantly improves the performance of foundation models. We provide an in-depth theoretical analysis of AE-SVC. Addressing efficiency, we introduce Single-shot Similarity Space Distillation ((SS)$_2$D), a novel approach to learn embeddings with adaptive sizes that offers a better trade-off between size and performance. We conducted extensive experiments on four retrieval datasets, including Stanford Online Products (SoP) and Pittsburgh30k, using four different off-the-shelf foundation models, including DinoV2 and CLIP. AE-SVC demonstrates up to a $16\%$ improvement in retrieval performance, while (SS)$_2$D shows a further $10\%$ improvement for smaller embedding sizes.

Exploiting Distribution Constraints for Scalable and Efficient Image Retrieval

TL;DR

This paper tackles the scalability and efficiency gap in image retrieval by leveraging off-the-shelf foundation models and introducing two unsupervised techniques: AE-SVC, which imposes orthogonality, mean-centering, and unit-variance constraints on a projection autoencoder to minimize the variance of cosine similarities, and (SS)D, which distills a teacher cosine-space into multiple smaller, adaptive embeddings via KL divergence. Theoretical analysis shows that minimizing the cosine-variance enhances discriminative power, and empirical results across four datasets and several foundation models show AE-SVC achieves up to 16% retrieval gains, with (SS)D delivering up to 10% improvements at smaller embedding sizes and approaching an upper bound set by per-dimension distillation. The approach improves retrieval speed and storage efficiency by enabling smaller embeddings without retraining dataset-specific models, making foundation-model-based image retrieval more scalable for robotics and vision applications. Overall, the combination of variance-aware representation learning and one-shot adaptive embedding distillation offers a practical path to scalable, high-performance image retrieval in diverse domains.

Abstract

Image retrieval is crucial in robotics and computer vision, with downstream applications in robot place recognition and vision-based product recommendations. Modern retrieval systems face two key challenges: scalability and efficiency. State-of-the-art image retrieval systems train specific neural networks for each dataset, an approach that lacks scalability. Furthermore, since retrieval speed is directly proportional to embedding size, existing systems that use large embeddings lack efficiency. To tackle scalability, recent works propose using off-the-shelf foundation models. However, these models, though applicable across datasets, fall short in achieving performance comparable to that of dataset-specific models. Our key observation is that, while foundation models capture necessary subtleties for effective retrieval, the underlying distribution of their embedding space can negatively impact cosine similarity searches. We introduce Autoencoders with Strong Variance Constraints (AE-SVC), which, when used for projection, significantly improves the performance of foundation models. We provide an in-depth theoretical analysis of AE-SVC. Addressing efficiency, we introduce Single-shot Similarity Space Distillation ((SS)D), a novel approach to learn embeddings with adaptive sizes that offers a better trade-off between size and performance. We conducted extensive experiments on four retrieval datasets, including Stanford Online Products (SoP) and Pittsburgh30k, using four different off-the-shelf foundation models, including DinoV2 and CLIP. AE-SVC demonstrates up to a improvement in retrieval performance, while (SS)D shows a further improvement for smaller embedding sizes.

Paper Structure

This paper contains 21 sections, 17 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Two-step pipeline for the proposed approach.(A)AE-SVC (discussed in Sec. \ref{['sec:method_vapca']}) trains an autoencoder with our constraints to improve foundation model embeddings. (B)(SS)$_2$D (discussed in Sec. \ref{['sec:ss2d']}) uses the improved embeddings from AE-SVC to learn adaptive embeddings for improved retrieval at any embedding size. (C) Once trained, (SS)$_2$D can be directly applied to foundation model embeddings to generate adaptive embeddings for improved retrieval. (D)AE-SVC (orange) boosts performance significantly, while (SS)$_2$D (green) further enhances results with smaller embeddings. Dino (blue) achieves optimal performance at 9 GLOPs, whereas (SS)$_2$D on top of AE-SVC achieves similar performance at only 2.5 GLOPs.
  • Figure 2: In standard dimensionality reduction (say PCA), cosine similarity is disproportionately influenced by high-variance dimensions, leading to poor retrieval. Given a task to match the query with the correct clothing type. A query image of a white man in a tank-top may be incorrectly matched to a white man in a half-shirt due to the dominant person dimension. Ideally, both orthogonal dimensions should have an equal influence on cosine similarity.
  • Figure 3: AE-SVC reduces the variance of cosine similarity distributions in both foundation (a) and dataset-specific models (b), with a more significant shift in foundation models (a). This results in greater improvement in retrieval performance for the foundation model (Dino) compared to the dataset-specific model (Cosplace), as shown in (c).
  • Figure 4: AE-SVC significantly improves the retrieval performance of foundation models.AE-SVC (solid lines) consistently outperforms the off-the-shelf foundation models, i.e., PCA (dashed lines), on four datasets, achieving a 15.5% average improvement in retrieval performance.
  • Figure 5: Applying (SS)$_2$D over AE-SVC leads to further performance boost at lower embedding sizes. Compared to VAE and SSD, (SS)$_2$D offers superior single-shot dimensionality reduction, achieving up to a 10% enhancement at smaller embedding sizes, closely approaching SSD's theoretical upper bound.
  • ...and 4 more figures