Table of Contents
Fetching ...

TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models

Leigang Qu, Haochuan Li, Tan Wang, Wenjie Wang, Yongqi Li, Liqiang Nie, Tat-Seng Chua

TL;DR

The paper presents TIGeR-ONE, a training-free, autoregressive framework that unifies text-to-image generation and retrieval within a single Large Multimodal Model, augmented by an autonomous decision mechanism to select between generated and retrieved images. It centers on intrinsic cross-modal discriminative abilities of LMMs, introducing three proxies for semantic similarity and adopting forward beam search with reverse re-ranking to perform generative retrieval. A dedicated TIGeR-Bench assesses performance across eight domains spanning creative and knowledge-intensive content, and extensive experiments on TIGeR-Bench, Flickr30K, and MS-COCO demonstrate superior unified performance and robust retrieval capabilities compared to state-of-the-art baselines. The work advances practical image acquisition by enabling flexible, knowledge-aware, and efficient delivery of both novel and retrieved visuals, and it provides a training-free, model-agnostic approach suitable for diverse LMMs and real-world scenarios.

Abstract

How humans can effectively and efficiently acquire images has always been a perennial question. A classic solution is text-to-image retrieval from an existing database; however, the limited database typically lacks creativity. By contrast, recent breakthroughs in text-to-image generation have made it possible to produce attractive and counterfactual visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval, proposing a unified framework for both tasks with one single Large Multimodal Model (LMM). Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner. Subsequently, we unify generation and retrieval autoregressively and propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt. To standardize the evaluation of unified text-to-image generation and retrieval, we construct TIGeR-Bench, a benchmark spanning both creative and knowledge-intensive domains. Extensive experiments on TIGeR-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework.

TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models

TL;DR

The paper presents TIGeR-ONE, a training-free, autoregressive framework that unifies text-to-image generation and retrieval within a single Large Multimodal Model, augmented by an autonomous decision mechanism to select between generated and retrieved images. It centers on intrinsic cross-modal discriminative abilities of LMMs, introducing three proxies for semantic similarity and adopting forward beam search with reverse re-ranking to perform generative retrieval. A dedicated TIGeR-Bench assesses performance across eight domains spanning creative and knowledge-intensive content, and extensive experiments on TIGeR-Bench, Flickr30K, and MS-COCO demonstrate superior unified performance and robust retrieval capabilities compared to state-of-the-art baselines. The work advances practical image acquisition by enabling flexible, knowledge-aware, and efficient delivery of both novel and retrieved visuals, and it provides a training-free, model-agnostic approach suitable for diverse LMMs and real-world scenarios.

Abstract

How humans can effectively and efficiently acquire images has always been a perennial question. A classic solution is text-to-image retrieval from an existing database; however, the limited database typically lacks creativity. By contrast, recent breakthroughs in text-to-image generation have made it possible to produce attractive and counterfactual visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval, proposing a unified framework for both tasks with one single Large Multimodal Model (LMM). Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner. Subsequently, we unify generation and retrieval autoregressively and propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt. To standardize the evaluation of unified text-to-image generation and retrieval, we construct TIGeR-Bench, a benchmark spanning both creative and knowledge-intensive domains. Extensive experiments on TIGeR-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework.
Paper Structure (32 sections, 4 equations, 13 figures, 20 tables)

This paper contains 32 sections, 4 equations, 13 figures, 20 tables.

Figures (13)

  • Figure 1: TIGeR-ONE unifies T2I-G and T2I-R through one single LMM in a training-free autoregressive way, with a decision mechanism to adaptively select between generated and retrieved images based on user prompts. Besides, we construct TIGeR-Bench, encompassing eight creative and knowledge-intensive domains in total to facilitate a comprehensive evaluation of TIGeR.
  • Figure 2: Overview of the TIGeR-ONE framework to unify text-to-image generation and retrieval. Images from the database are first tokenized into discrete codes and a lookup table is maintained for the correspondence between discrete codes and images. The given prompt $X$ is first fed into an LMM and Forward Beam Search is performed to retrieve and generate images in parallel. The prompt and obtained images are then fed into the same LLM for Reverse Re-Ranking and Decision-making.
  • Figure 3: The influence of the debiasing factor $\eta$ in Eqn. \ref{['eqn:x_to_y_debias']} on the forward ranking performance of SEED-LLaMA and LaVIT on the MS-COCO dataset. The best performance is achieved around $\eta = 1$.
  • Figure 4: Retrieval performance on MS-COCO with different beam sizes and re-ranking strategies. Light and dark dash lines denote the forward and reverse ranking performance, respectively. S: SEED-LLaMA. L: LaVIT.
  • Figure 5: Comparison of retrieval efficiency quantified by the number of processed prompts per second among CLIP (ViT-B/32), GILL, and the proposed generative retrieval method based on SEED-LLaMA.
  • ...and 8 more figures