CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling
Minghui Fang, Shengpeng Ji, Jialong Zuo, Hai Huang, Yan Xia, Jieming Zhu, Xize Cheng, Xiaoda Yang, Wenrui Liu, Gang Wang, Zhenhua Dong, Zhou Zhao
TL;DR
CART introduces a generative cross-modal retrieval framework that replaces score-based matching with autoregressive generation of semantic identifiers. It discretizes multimodal data into coarse tokens via K-Means and fine tokens via RQ-VAE, augmented by a unique-token prefix system, and aligns queries through caption-enhanced inputs. A two-branch coarse-to-fine feature fusion mechanism enables effective interaction between queries and candidates within a compact autoregressive decoder, while training uses a consistency-regularized objective and constrained beam search. Empirical results across text-to-image/audio/video tasks show CART achieving strong retrieval performance with stable throughput, outperforming several single-tower, dual-tower, and generative baselines and validating the importance of the identifier design and fusion strategy.
Abstract
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data. Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates, which is challenged by training cost and inference latency with large-scale data. Inspired by the remarkable performance and efficiency of generative models, we propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling, which assigns identifiers to each candidate and treats the generating identifier as the retrieval target. Specifically, we explore an effective coarse-to-fine scheme, combining K-Means and RQ-VAE to discretize multimodal data into token sequences that support autoregressive generation. Further, considering the lack of explicit interaction between queries and candidates, we propose a feature fusion strategy to align their semantics. Extensive experiments demonstrate the effectiveness of the strategies in the CART, achieving excellent results in both retrieval performance and efficiency.
