Table of Contents
Fetching ...

IRGen: Generative Modeling for Image Retrieval

Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Mao Yang, Qingmin Liao, Jingdong Wang, Baining Guo

TL;DR

IRGen introduces a unified, end-to-end generative framework for image retrieval by reframing retrieval as autoregressive generation of image identifiers. It deploys a semantic image tokenizer that outputs a short token sequence of length $M$ based on a global feature $f_{cls}$ via residual quantization. A Vision Transformer–based encoder $ _E$ and Transformer decoder $ _D$ predict the tokens $l_1, dots,l_M$ from a query image, optimizing $p(l_1, dots,l_M|x_1, heta)$ in an autoregressive fashion. Experiments on three standard benchmarks and million-scale datasets demonstrate state-of-the-art accuracy and scalable throughput, with the potential to bypass reranking in certain deployment scenarios.

Abstract

While generative modeling has become prevalent across numerous research fields, its integration into the realm of image retrieval remains largely unexplored and underjustified. In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling and employing a sequence-to-sequence model. This approach is harmoniously aligned with the current trend towards unification in research, presenting a cohesive framework that allows for end-to-end differentiable searching. This, in turn, facilitates superior performance via direct optimization techniques. The development of our model, dubbed IRGen, addresses the critical technical challenge of converting an image into a concise sequence of semantic units, which is pivotal for enabling efficient and effective search. Extensive experiments demonstrate that our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks as well as two million-scale datasets, yielding significant improvement compared to prior competitive retrieval methods. In addition, the notable surge in precision scores facilitated by generative modeling presents the potential to bypass the reranking phase, which is traditionally indispensable in practical retrieval workflows.

IRGen: Generative Modeling for Image Retrieval

TL;DR

IRGen introduces a unified, end-to-end generative framework for image retrieval by reframing retrieval as autoregressive generation of image identifiers. It deploys a semantic image tokenizer that outputs a short token sequence of length based on a global feature via residual quantization. A Vision Transformer–based encoder and Transformer decoder predict the tokens from a query image, optimizing in an autoregressive fashion. Experiments on three standard benchmarks and million-scale datasets demonstrate state-of-the-art accuracy and scalable throughput, with the potential to bypass reranking in certain deployment scenarios.

Abstract

While generative modeling has become prevalent across numerous research fields, its integration into the realm of image retrieval remains largely unexplored and underjustified. In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling and employing a sequence-to-sequence model. This approach is harmoniously aligned with the current trend towards unification in research, presenting a cohesive framework that allows for end-to-end differentiable searching. This, in turn, facilitates superior performance via direct optimization techniques. The development of our model, dubbed IRGen, addresses the critical technical challenge of converting an image into a concise sequence of semantic units, which is pivotal for enabling efficient and effective search. Extensive experiments demonstrate that our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks as well as two million-scale datasets, yielding significant improvement compared to prior competitive retrieval methods. In addition, the notable surge in precision scores facilitated by generative modeling presents the potential to bypass the reranking phase, which is traditionally indispensable in practical retrieval workflows.
Paper Structure (15 sections, 3 equations, 9 figures, 10 tables)

This paper contains 15 sections, 3 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: The left of the figure illustrates the training pipline of encoder-decoder architecture for autoregressive retrieval. The training objective is to autoregressively predict the identifier of the query's nearest neighbor image. The right of the figure illustrates the procedure of beam search with code length is 2 and beam size K is 3.
  • Figure 2: Precision-Recall (TPR) curve comparison for different methods on (a) In-shop Clothes, (b) CUB200 and (c) Cars196 dataset.
  • Figure 3: MRR with respect to 1,2,4,8 comparison for different methods on (a) In-shop Clothes, (b) CUB200 and (c) Cars196 dataset.
  • Figure 4: Precision comparison on large scale datasets: ImageNet and Places365.
  • Figure 5: Illustrating the search speed using beam search.
  • ...and 4 more figures