Table of Contents
Fetching ...

QuARI: Query Adaptive Retrieval Improvement

Eric Xing, Abby Stylianou, Robert Pless, Nathan Jacobs

TL;DR

QuARI addresses the challenge of fine-grained instance retrieval in very large image collections by introducing a query-adaptive retrieval mechanism. A transformer-based hypernetwork predicts a per-query linear projection $T$ and a transformed query $\mathbf{q}'$, applying the low-rank $T$ to gallery embeddings to enable fast, per-query reweighting of the embedding space. Through semi-positive sample mining and a symmetric contrastive loss, QuARI achieves substantial gains over static task adaptation and traditional re-ranking across ILIAS and INQUIRE benchmarks, while maintaining low computation. The results demonstrate that small, query-conditioned adaptations of global vision-language embeddings can markedly improve retrieval performance at scale, with practical efficiency suitable for real-time or large-scale deployment.

Abstract

Massive-scale pretraining has made vision-language models increasingly popular for image-to-image and text-to-image retrieval across a broad collection of domains. However, these models do not perform well when used for challenging retrieval tasks, such as instance retrieval in very large-scale image collections. Recent work has shown that linear transformations of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest. In this paper, we explore a more extreme version of this specialization by learning to map a given query to a query-specific feature space transformation. Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings, making it effective for large-scale retrieval or re-ranking. Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more computation at query time.

QuARI: Query Adaptive Retrieval Improvement

TL;DR

QuARI addresses the challenge of fine-grained instance retrieval in very large image collections by introducing a query-adaptive retrieval mechanism. A transformer-based hypernetwork predicts a per-query linear projection and a transformed query , applying the low-rank to gallery embeddings to enable fast, per-query reweighting of the embedding space. Through semi-positive sample mining and a symmetric contrastive loss, QuARI achieves substantial gains over static task adaptation and traditional re-ranking across ILIAS and INQUIRE benchmarks, while maintaining low computation. The results demonstrate that small, query-conditioned adaptations of global vision-language embeddings can markedly improve retrieval performance at scale, with practical efficiency suitable for real-time or large-scale deployment.

Abstract

Massive-scale pretraining has made vision-language models increasingly popular for image-to-image and text-to-image retrieval across a broad collection of domains. However, these models do not perform well when used for challenging retrieval tasks, such as instance retrieval in very large-scale image collections. Recent work has shown that linear transformations of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest. In this paper, we explore a more extreme version of this specialization by learning to map a given query to a query-specific feature space transformation. Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings, making it effective for large-scale retrieval or re-ranking. Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more computation at query time.

Paper Structure

This paper contains 35 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We propose a new query-specific approach to retrieval, QuARI. QuARI dynamically adapts embeddings per-query to significantly improve retrieval performance compared to non-specific retrieval with general purpose embedding features like CLIP, and domain-specific retrieval with transformations learned for a specific domain, with little computational overhead. Figure \ref{['fig:arch']} shows the details of the Query Adaptation Network.
  • Figure 2: An overview of our query adaptation network. A zero initialization of the transformation matrix is tokenized by columns and passed to a transformer backbone with a conditioning token to obtain refined columns. This process is repeated $L$ times, refining the transformation.
  • Figure 3: t-SNE visualizations comparing original features and QuARI features.
  • Figure 4: Comparison of re-ranking performance and inference cost for image-to-image retrieval on the ILIAS dataset (left) and text-to-image retrieval on the INQUIRE dataset (right).