CoST: Contrastive Quantization based Semantic Tokenization for Generative Recommendation
Jieming Zhu, Mengqun Jin, Qijiong Liu, Zexuan Qiu, Zhenhua Dong, Xiu Li
TL;DR
CoST introduces contrastive quantization for semantic tokenization to address limitations of reconstruction-based tokenization in generative recommender systems. By replacing exact reconstruction with a batch-level contrastive objective, CoST preserves item neighborhood relationships in the token space, enabling more effective autoregressive item generation. Empirical results on MIND and Amazon Office show substantial gains over RQ-VAE baselines, with Recall@5 and NDCG@5 improving by up to roughly 43-44% on MIND. This work highlights the critical role of semantic tokenization quality in generative retrieval and suggests directions for further integrating neighborhood signals and multimodal information into tokenization.
Abstract
Embedding-based retrieval serves as a dominant approach to candidate item matching for industrial recommender systems. With the success of generative AI, generative retrieval has recently emerged as a new retrieval paradigm for recommendation, which casts item retrieval as a generation problem. Its model consists of two stages: semantic tokenization and autoregressive generation. The first stage involves item tokenization that constructs discrete semantic tokens to index items, while the second stage autoregressively generates semantic tokens of candidate items. Therefore, semantic tokenization serves as a crucial preliminary step for training generative recommendation models. Existing research usually employs a vector quantizier with reconstruction loss (e.g., RQ-VAE) to obtain semantic tokens of items, but this method fails to capture the essential neighborhood relationships that are vital for effective item modeling in recommender systems. In this paper, we propose a contrastive quantization-based semantic tokenization approach, named CoST, which harnesses both item relationships and semantic information to learn semantic tokens. Our experimental results highlight the significant impact of semantic tokenization on generative recommendation performance, with CoST achieving up to a 43% improvement in Recall@5 and 44% improvement in NDCG@5 on the MIND dataset over previous baselines.
