Table of Contents
Fetching ...

SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs

Haoxuan Li, Yi Bin, Yunshan Ma, Guoqing Wang, Yang Yang, See-Kiong Ng, Tat-Seng Chua

TL;DR

SemCORE tackles semantic insufficiency in generative cross-modal retrieval by introducing Structured Natural Language Identifiers (SID) that fuse Global (macroscopic) and Lexical (fine-grained) semantics with a Generative Semantic Verification (GSV) module for precise discrimination. By training with SID token generation and applying constrained decoding via a Trie, SemCORE achieves strong performance on text-to-image and image-to-text tasks, surpassing state-of-the-art generative baselines and approaching traditional methods in I2T retrieval. Key contributions include a unified framework for both retrieval directions, a principled SID construction using K-Means and KeyBERT, and a verification step that exploits MLLMs for fine-grained semantic matching. The results on Flickr30K and MS-COCO demonstrate the practical impact of semantic-aware generative retrieval, with robust ablations and analyses guiding future work on model scale and identifier design in dynamic, large-scale datasets.

Abstract

Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.

SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs

TL;DR

SemCORE tackles semantic insufficiency in generative cross-modal retrieval by introducing Structured Natural Language Identifiers (SID) that fuse Global (macroscopic) and Lexical (fine-grained) semantics with a Generative Semantic Verification (GSV) module for precise discrimination. By training with SID token generation and applying constrained decoding via a Trie, SemCORE achieves strong performance on text-to-image and image-to-text tasks, surpassing state-of-the-art generative baselines and approaching traditional methods in I2T retrieval. Key contributions include a unified framework for both retrieval directions, a principled SID construction using K-Means and KeyBERT, and a verification step that exploits MLLMs for fine-grained semantic matching. The results on Flickr30K and MS-COCO demonstrate the practical impact of semantic-aware generative retrieval, with robust ablations and analyses guiding future work on model scale and identifier design in dynamic, large-scale datasets.

Abstract

Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.

Paper Structure

This paper contains 28 sections, 3 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustrations of existing paradigms for cross-modal retrieval. Both one-tower and two-tower frameworks perform retrieval based on certain similarity-based metrics, while the generative framework directly generates the target ID as retrieval result. Existing generative methods typically use hierarchical clustering for identifier construction.
  • Figure 2: An overview of the proposed SemCORE framework (illustrating the text-to-image retrieval process, with the image-to-text process being analogous). Specifically, we integrate a clustering algorithm with keyword extraction techniques to construct Structured natural language IDentifiers (SID), as illustrated in (a). Furthermore, we introduce a Generative Semantic Verification (GSV) strategy for nuanced semantic discrimination, as indicated by the dotted line in (b).
  • Figure 3: Performance with respect to the cluster size of global ID and the length of lexical ID on Flickr30K dataset.
  • Figure 4: Illustration of the structured natural language identifier (SID). The SID comprises two components: global ID and lexical ID. Global IDs are highlighted in blue, while lexical IDs are highlighted in pink. The generated lexical IDs are closely aligned with the corresponding image content. Subfigures (g) and (h) generate the SID of each other.
  • Figure 5: Illustration of additional generative retrieval examples. Items from (j) to (o) correspond to the image-to-text retrieval task. The generated SIDs are closely aligned with the corresponding target content. Global IDs within each SID are highlighted in blue, while lexical IDs are marked in pink. Global ID ensures informative and lexical ID ensures discriminative.
  • ...and 1 more figures