Table of Contents
Fetching ...

How to Index Item IDs for Recommendation Foundation Models

Wenyue Hua, Shuyuan Xu, Yingqiang Ge, Yongfeng Zhang

TL;DR

The paper tackles how to index items for generation-based recommender systems using LLMs, focusing on preventing hallucinations and long outputs by designing LL-friendly IDs. It analyzes three trivial indexing methods and proposes four nontrivial strategies—Sequential Indexing (SID), Collaborative Indexing (CID), Semantic Indexing (SemID), and Hybrid Indexing (HID)—with a trie-based constrained decoding mechanism to ensure IDs correspond to real items. Through experiments on Amazon Beauty, Amazon Sports, and Yelp with the P5 backbone, CID+IID and SemID+IID consistently outperform baselines, demonstrating that integrating collaborative or semantic priors into IDs plus a compact tokenization improves generation-quality recommendations. The findings highlight the synergy between modern language modeling and traditional IR principles, showing that careful item indexing is a practical lever to enhance inference and learning in recommendation foundation models, with code and data available online.

Abstract

Recommendation foundation model utilizes large language models (LLM) for recommendation by converting recommendation tasks into natural language tasks. It enables generative recommendation which directly generates the item(s) to recommend rather than calculating a ranking score for each and every candidate item as in traditional recommendation models, simplifying the recommendation pipeline from multi-stage filtering to single-stage filtering. To avoid generating excessively long text and hallucinated recommendations when deciding which item(s) to recommend, creating LLM-compatible item IDs to uniquely identify each item is essential for recommendation foundation models. In this study, we systematically examine the item ID creation and indexing problem for recommendation foundation models, using P5 as an example of the backbone LLM. To emphasize the importance of item indexing, we first discuss the issues of several trivial item indexing methods, such as random indexing, title indexing, and independent indexing. We then propose four simple yet effective solutions, including sequential indexing, collaborative indexing, semantic (content-based) indexing, and hybrid indexing. Our study highlights the significant influence of item indexing methods on the performance of LLM-based recommendation, and our results on real-world datasets validate the effectiveness of our proposed solutions. The research also demonstrates how recent advances on language modeling and traditional IR principles such as indexing can help each other for better learning and inference. Source code and data are available at https://github.com/Wenyueh/LLM-RecSys-ID.

How to Index Item IDs for Recommendation Foundation Models

TL;DR

The paper tackles how to index items for generation-based recommender systems using LLMs, focusing on preventing hallucinations and long outputs by designing LL-friendly IDs. It analyzes three trivial indexing methods and proposes four nontrivial strategies—Sequential Indexing (SID), Collaborative Indexing (CID), Semantic Indexing (SemID), and Hybrid Indexing (HID)—with a trie-based constrained decoding mechanism to ensure IDs correspond to real items. Through experiments on Amazon Beauty, Amazon Sports, and Yelp with the P5 backbone, CID+IID and SemID+IID consistently outperform baselines, demonstrating that integrating collaborative or semantic priors into IDs plus a compact tokenization improves generation-quality recommendations. The findings highlight the synergy between modern language modeling and traditional IR principles, showing that careful item indexing is a practical lever to enhance inference and learning in recommendation foundation models, with code and data available online.

Abstract

Recommendation foundation model utilizes large language models (LLM) for recommendation by converting recommendation tasks into natural language tasks. It enables generative recommendation which directly generates the item(s) to recommend rather than calculating a ranking score for each and every candidate item as in traditional recommendation models, simplifying the recommendation pipeline from multi-stage filtering to single-stage filtering. To avoid generating excessively long text and hallucinated recommendations when deciding which item(s) to recommend, creating LLM-compatible item IDs to uniquely identify each item is essential for recommendation foundation models. In this study, we systematically examine the item ID creation and indexing problem for recommendation foundation models, using P5 as an example of the backbone LLM. To emphasize the importance of item indexing, we first discuss the issues of several trivial item indexing methods, such as random indexing, title indexing, and independent indexing. We then propose four simple yet effective solutions, including sequential indexing, collaborative indexing, semantic (content-based) indexing, and hybrid indexing. Our study highlights the significant influence of item indexing methods on the performance of LLM-based recommendation, and our results on real-world datasets validate the effectiveness of our proposed solutions. The research also demonstrates how recent advances on language modeling and traditional IR principles such as indexing can help each other for better learning and inference. Source code and data are available at https://github.com/Wenyueh/LLM-RecSys-ID.
Paper Structure (23 sections, 5 figures, 9 tables)

This paper contains 23 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Illustration of spectral clustering on the item co-appearance graph based on spectral matrix factorization
  • Figure 2: Collaborative indexing based on the spectral clustering tree ($N=4$, $k=20$).
  • Figure 3: An example of semantic indexing
  • Figure 4: CID Beauty ablations on $N$ (number of clusters at each level) and $k$ (maximum number of items allowed in the final cluster).
  • Figure 5: CID average length on Beauty.