Universal Item Tokenization for Transferable Generative Recommendation
Bowen Zheng, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ji-Rong Wen
TL;DR
UTGRec tackles the challenge of transfer in generative recommendation by introducing a universal item tokenizer that encodes items into discrete code sequences using a multimodal large language model and tree structured codebooks. It combines raw content reconstruction with collaborative integration to learn cross domain item semantics and cross domain collaboration signals. The framework pre trains across multiple domains and then fine tunes on downstream domains, preserving learned codebooks while adapting projection layers. Empirical results on four public datasets show UTGRec outperforms traditional, content based, and existing generative baselines, with strong gains on long tail items and cross domain transfer. This work advances transferable generative recommendation by enabling universal multimodal item representations and scalable cross domain learning.
Abstract
Recently, generative recommendation has emerged as a promising paradigm, attracting significant research attention. The basic framework involves an item tokenizer, which represents each item as a sequence of codes serving as its identifier, and a generative recommender that predicts the next item by autoregressively generating the target item identifier. However, in existing methods, both the tokenizer and the recommender are typically domain-specific, limiting their ability for effective transfer or adaptation to new domains. To this end, we propose UTGRec, a Universal item Tokenization approach for transferable Generative Recommendation. Specifically, we design a universal item tokenizer for encoding rich item semantics by adapting a multimodal large language model (MLLM). By devising tree-structured codebooks, we discretize content representations into corresponding codes for item tokenization. To effectively learn the universal item tokenizer on multiple domains, we introduce two key techniques in our approach. For raw content reconstruction, we employ dual lightweight decoders to reconstruct item text and images from discrete representations to capture general knowledge embedded in the content. For collaborative knowledge integration, we assume that co-occurring items are similar and integrate collaborative signals through co-occurrence alignment and reconstruction. Finally, we present a joint learning framework to pre-train and adapt the transferable generative recommender across multiple domains. Extensive experiments on four public datasets demonstrate the superiority of UTGRec compared to both traditional and generative recommendation baselines.
