QuARI: Query Adaptive Retrieval Improvement
Eric Xing, Abby Stylianou, Robert Pless, Nathan Jacobs
TL;DR
QuARI addresses the challenge of fine-grained instance retrieval in very large image collections by introducing a query-adaptive retrieval mechanism. A transformer-based hypernetwork predicts a per-query linear projection $T$ and a transformed query $\mathbf{q}'$, applying the low-rank $T$ to gallery embeddings to enable fast, per-query reweighting of the embedding space. Through semi-positive sample mining and a symmetric contrastive loss, QuARI achieves substantial gains over static task adaptation and traditional re-ranking across ILIAS and INQUIRE benchmarks, while maintaining low computation. The results demonstrate that small, query-conditioned adaptations of global vision-language embeddings can markedly improve retrieval performance at scale, with practical efficiency suitable for real-time or large-scale deployment.
Abstract
Massive-scale pretraining has made vision-language models increasingly popular for image-to-image and text-to-image retrieval across a broad collection of domains. However, these models do not perform well when used for challenging retrieval tasks, such as instance retrieval in very large-scale image collections. Recent work has shown that linear transformations of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest. In this paper, we explore a more extreme version of this specialization by learning to map a given query to a query-specific feature space transformation. Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings, making it effective for large-scale retrieval or re-ranking. Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more computation at query time.
