TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models
Leigang Qu, Haochuan Li, Tan Wang, Wenjie Wang, Yongqi Li, Liqiang Nie, Tat-Seng Chua
TL;DR
The paper presents TIGeR-ONE, a training-free, autoregressive framework that unifies text-to-image generation and retrieval within a single Large Multimodal Model, augmented by an autonomous decision mechanism to select between generated and retrieved images. It centers on intrinsic cross-modal discriminative abilities of LMMs, introducing three proxies for semantic similarity and adopting forward beam search with reverse re-ranking to perform generative retrieval. A dedicated TIGeR-Bench assesses performance across eight domains spanning creative and knowledge-intensive content, and extensive experiments on TIGeR-Bench, Flickr30K, and MS-COCO demonstrate superior unified performance and robust retrieval capabilities compared to state-of-the-art baselines. The work advances practical image acquisition by enabling flexible, knowledge-aware, and efficient delivery of both novel and retrieved visuals, and it provides a training-free, model-agnostic approach suitable for diverse LMMs and real-world scenarios.
Abstract
How humans can effectively and efficiently acquire images has always been a perennial question. A classic solution is text-to-image retrieval from an existing database; however, the limited database typically lacks creativity. By contrast, recent breakthroughs in text-to-image generation have made it possible to produce attractive and counterfactual visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval, proposing a unified framework for both tasks with one single Large Multimodal Model (LMM). Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner. Subsequently, we unify generation and retrieval autoregressively and propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt. To standardize the evaluation of unified text-to-image generation and retrieval, we construct TIGeR-Bench, a benchmark spanning both creative and knowledge-intensive domains. Extensive experiments on TIGeR-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework.
