SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs

Haoxuan Li; Yi Bin; Yunshan Ma; Guoqing Wang; Yang Yang; See-Kiong Ng; Tat-Seng Chua

SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs

Haoxuan Li, Yi Bin, Yunshan Ma, Guoqing Wang, Yang Yang, See-Kiong Ng, Tat-Seng Chua

TL;DR

SemCORE tackles semantic insufficiency in generative cross-modal retrieval by introducing Structured Natural Language Identifiers (SID) that fuse Global (macroscopic) and Lexical (fine-grained) semantics with a Generative Semantic Verification (GSV) module for precise discrimination. By training with SID token generation and applying constrained decoding via a Trie, SemCORE achieves strong performance on text-to-image and image-to-text tasks, surpassing state-of-the-art generative baselines and approaching traditional methods in I2T retrieval. Key contributions include a unified framework for both retrieval directions, a principled SID construction using K-Means and KeyBERT, and a verification step that exploits MLLMs for fine-grained semantic matching. The results on Flickr30K and MS-COCO demonstrate the practical impact of semantic-aware generative retrieval, with robust ablations and analyses guiding future work on model scale and identifier design in dynamic, large-scale datasets.

Abstract

Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.

SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs

TL;DR

Abstract

SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)