Table of Contents
Fetching ...

Enhancing Item Tokenization for Generative Recommendation through Self-Improvement

Runjin Chen, Mingxuan Ju, Ngoc Bui, Dimosthenis Antypas, Stanley Cai, Xiaopeng Wu, Leonardo Neves, Zhangyang Wang, Neil Shah, Tong Zhao

TL;DR

This paper tackles the tokenization bottleneck in generative recommendations by enabling an LLM to self-improve item identifiers during training. It introduces SIIT, a three-stage process of sequential-recommendation fine-tuning, item-identifier alignment, and identifier refinement, augmented with diverse token generation and collision-avoidance mechanisms. Across three diverse datasets and multiple initialization schemes, SIIT yields consistent gains (notably an average $8\%$ improvement) by aligning tokenization with the LLM's semantic understanding. The approach is lightweight, plug‑and‑play, and improves both recommendation accuracy and the semantic distinctiveness of item identifiers, enabling better generalization and diversity in generated recommendations.

Abstract

Generative recommendation systems, driven by large language models (LLMs), present an innovative approach to predicting user preferences by modeling items as token sequences and generating recommendations in a generative manner. A critical challenge in this approach is the effective tokenization of items, ensuring that they are represented in a form compatible with LLMs. Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens. While text-based representations integrate seamlessly with LLM tokenization, they are often too lengthy, leading to inefficiencies and complicating accurate generation. Numerical strings, while concise, lack semantic depth and fail to capture meaningful item relationships. Tokenizing items as sequences of newly defined tokens has gained traction, but it often requires external models or algorithms for token assignment. These external processes may not align with the LLM's internal pretrained tokenization schema, leading to inconsistencies and reduced model performance. To address these limitations, we propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process. Our approach starts with item tokenizations generated by any external model and periodically adjusts these tokenizations based on the LLM's learned patterns. Such alignment process ensures consistency between the tokenization and the LLM's internal understanding of the items, leading to more accurate recommendations. Furthermore, our method is simple to implement and can be integrated as a plug-and-play enhancement into existing generative recommendation systems. Experimental results on multiple datasets and using various initial tokenization strategies demonstrate the effectiveness of our method, with an average improvement of 8\% in recommendation performance.

Enhancing Item Tokenization for Generative Recommendation through Self-Improvement

TL;DR

This paper tackles the tokenization bottleneck in generative recommendations by enabling an LLM to self-improve item identifiers during training. It introduces SIIT, a three-stage process of sequential-recommendation fine-tuning, item-identifier alignment, and identifier refinement, augmented with diverse token generation and collision-avoidance mechanisms. Across three diverse datasets and multiple initialization schemes, SIIT yields consistent gains (notably an average improvement) by aligning tokenization with the LLM's semantic understanding. The approach is lightweight, plug‑and‑play, and improves both recommendation accuracy and the semantic distinctiveness of item identifiers, enabling better generalization and diversity in generated recommendations.

Abstract

Generative recommendation systems, driven by large language models (LLMs), present an innovative approach to predicting user preferences by modeling items as token sequences and generating recommendations in a generative manner. A critical challenge in this approach is the effective tokenization of items, ensuring that they are represented in a form compatible with LLMs. Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens. While text-based representations integrate seamlessly with LLM tokenization, they are often too lengthy, leading to inefficiencies and complicating accurate generation. Numerical strings, while concise, lack semantic depth and fail to capture meaningful item relationships. Tokenizing items as sequences of newly defined tokens has gained traction, but it often requires external models or algorithms for token assignment. These external processes may not align with the LLM's internal pretrained tokenization schema, leading to inconsistencies and reduced model performance. To address these limitations, we propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process. Our approach starts with item tokenizations generated by any external model and periodically adjusts these tokenizations based on the LLM's learned patterns. Such alignment process ensures consistency between the tokenization and the LLM's internal understanding of the items, leading to more accurate recommendations. Furthermore, our method is simple to implement and can be integrated as a plug-and-play enhancement into existing generative recommendation systems. Experimental results on multiple datasets and using various initial tokenization strategies demonstrate the effectiveness of our method, with an average improvement of 8\% in recommendation performance.

Paper Structure

This paper contains 27 sections, 5 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: The overall pipeline of SIIT. We use off-the-shelf methods to initialize item identifiers, then iteratively performs sequential recommendation training, item-identifier alignment, and identifier refinement to adjust item identifiers.
  • Figure 2: Influence of intensity of Item-Identifier Alignment