Table of Contents
Fetching ...

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Yongqi Li, Hongru Cai, Wenjie Wang, Leigang Qu, Yinwei Wei, Wenjie Li, Liqiang Nie, Tat-Seng Chua

TL;DR

This work introduces AVG, an autoregressive token-to-voken framework for text-to-image retrieval that recasts retrieval as generation of image-derived vokens. A cross-modal aligned image tokenizer injects both visual content and high-level semantics into a compact voken sequence, while a discriminative training objective guides generation toward ranking quality rather than just token prediction. Empirical results on Flickr30K and MS-COCO show AVG outperforming prior generative methods and delivering strong efficiency, approaching or exceeding two-tower baselines in top-rank performance with better scalability. The ablations and analyses highlight the critical roles of semantic alignment and discriminative learning, as well as practical guidance on codebook size, voken length, and beam search for real-world retrieval workloads.

Abstract

Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a token-to-voken generation problem. AVG discretizes an image into a sequence of vokens as the identifier of the image, while maintaining the alignment with both the visual information and high-level semantics of the image. Additionally, to bridge the learning gap between generative training and the retrieval target, we incorporate discriminative training to modify the learning direction during token-to-voken training. Extensive experiments demonstrate that AVG achieves superior results in both effectiveness and efficiency.

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

TL;DR

This work introduces AVG, an autoregressive token-to-voken framework for text-to-image retrieval that recasts retrieval as generation of image-derived vokens. A cross-modal aligned image tokenizer injects both visual content and high-level semantics into a compact voken sequence, while a discriminative training objective guides generation toward ranking quality rather than just token prediction. Empirical results on Flickr30K and MS-COCO show AVG outperforming prior generative methods and delivering strong efficiency, approaching or exceeding two-tower baselines in top-rank performance with better scalability. The ablations and analyses highlight the critical roles of semantic alignment and discriminative learning, as well as practical guidance on codebook size, voken length, and beam search for real-world retrieval workloads.

Abstract

Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a token-to-voken generation problem. AVG discretizes an image into a sequence of vokens as the identifier of the image, while maintaining the alignment with both the visual information and high-level semantics of the image. Additionally, to bridge the learning gap between generative training and the retrieval target, we incorporate discriminative training to modify the learning direction during token-to-voken training. Extensive experiments demonstrate that AVG achieves superior results in both effectiveness and efficiency.
Paper Structure (23 sections, 10 equations, 5 figures, 7 tables)

This paper contains 23 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Illustrations of three paradigms for cross-modal retrieval. Both the one-tower and two-tower frameworks match the text and image for retrieval, while generative cross-modal retrieval generates the identifiers of images, e.g., image IDs, as the retrieval results.
  • Figure 2: An overview of the proposed AVG method. AVG tokenizes images into a sequence of vokens via the cross-modal aligned image tokenization and devises the discrimination modified token-to-voken generation.
  • Figure 3: Illustration of the cross-modal aligned image tokenization. We introduce the semantic alignment loss, $\mathcal{L}_{align}$, to guarantee the learned vokens are assigned based on an image's visual information and high-level semantics.
  • Figure 4: The efficiency of CLIP (two-tower), GRACE (generative), and AVG (generative) varies with image set size, measured in terms of queries processed per second. AVG demonstrates superior efficiency with large image sets.
  • Figure 5: Cases of image tokenization. Four pairs of images to illustrate our cross-modal aligned image tokenizer. In each pair, semantically related words in the caption are highlighted in red. Visual and semantic similar images are assigned similar voken IDs. (a) represents a pair of irrelevant images, while (d) depicts the pair of images with the highest similarity.