Table of Contents
Fetching ...

CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

Minghui Fang, Shengpeng Ji, Jialong Zuo, Hai Huang, Yan Xia, Jieming Zhu, Xize Cheng, Xiaoda Yang, Wenrui Liu, Gang Wang, Zhenhua Dong, Zhou Zhao

TL;DR

CART introduces a generative cross-modal retrieval framework that replaces score-based matching with autoregressive generation of semantic identifiers. It discretizes multimodal data into coarse tokens via K-Means and fine tokens via RQ-VAE, augmented by a unique-token prefix system, and aligns queries through caption-enhanced inputs. A two-branch coarse-to-fine feature fusion mechanism enables effective interaction between queries and candidates within a compact autoregressive decoder, while training uses a consistency-regularized objective and constrained beam search. Empirical results across text-to-image/audio/video tasks show CART achieving strong retrieval performance with stable throughput, outperforming several single-tower, dual-tower, and generative baselines and validating the importance of the identifier design and fusion strategy.

Abstract

Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data. Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates, which is challenged by training cost and inference latency with large-scale data. Inspired by the remarkable performance and efficiency of generative models, we propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling, which assigns identifiers to each candidate and treats the generating identifier as the retrieval target. Specifically, we explore an effective coarse-to-fine scheme, combining K-Means and RQ-VAE to discretize multimodal data into token sequences that support autoregressive generation. Further, considering the lack of explicit interaction between queries and candidates, we propose a feature fusion strategy to align their semantics. Extensive experiments demonstrate the effectiveness of the strategies in the CART, achieving excellent results in both retrieval performance and efficiency.

CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

TL;DR

CART introduces a generative cross-modal retrieval framework that replaces score-based matching with autoregressive generation of semantic identifiers. It discretizes multimodal data into coarse tokens via K-Means and fine tokens via RQ-VAE, augmented by a unique-token prefix system, and aligns queries through caption-enhanced inputs. A two-branch coarse-to-fine feature fusion mechanism enables effective interaction between queries and candidates within a compact autoregressive decoder, while training uses a consistency-regularized objective and constrained beam search. Empirical results across text-to-image/audio/video tasks show CART achieving strong retrieval performance with stable throughput, outperforming several single-tower, dual-tower, and generative baselines and validating the importance of the identifier design and fusion strategy.

Abstract

Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data. Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates, which is challenged by training cost and inference latency with large-scale data. Inspired by the remarkable performance and efficiency of generative models, we propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling, which assigns identifiers to each candidate and treats the generating identifier as the retrieval target. Specifically, we explore an effective coarse-to-fine scheme, combining K-Means and RQ-VAE to discretize multimodal data into token sequences that support autoregressive generation. Further, considering the lack of explicit interaction between queries and candidates, we propose a feature fusion strategy to align their semantics. Extensive experiments demonstrate the effectiveness of the strategies in the CART, achieving excellent results in both retrieval performance and efficiency.

Paper Structure

This paper contains 48 sections, 8 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Traditional single-tower / dual-tower retrieval matches the closest candidate to the query by calculating scores, while generative retrieval takes the generating candidate's identifier as the retrieval target.
  • Figure 2: A coarse-Fine semantic identifier generation strategy.
  • Figure 3: The architecture of the CART.
  • Figure 4: The efficiency of CLIP, CLAP and CART are measured by throughput (queries processed per second).
  • Figure 5: The t-SNE visualization of item embeddings which have the same token prefixes.