Table of Contents
Fetching ...

A Bi-Step Grounding Paradigm for Large Language Models in Recommendation Systems

Keqin Bao, Jizhi Zhang, Wenjie Wang, Yang Zhang, Zhengyi Yang, Yancheng Luo, Chong Chen, Fuli Feng, Qi Tian

TL;DR

This paper tackles the problem of evaluating LLMs for all-rank recommendations and introduces BIGRec, a bi-step grounding paradigm that connects language generation to real-item recommendations. The approach first grounds the language space to a recommendation space via instruction-tuning, then grounds that space to actual items by embedding-based matching, augmented with popularity and collaborative information to improve grounding. Empirical results on MovieLens10M and a Game subset show BIGRec achieves strong performance in few-shot and cross-domain settings, often matching or surpassing traditional models trained with orders of magnitude more data, and gains further when statistical signals are injected. The work highlights the semantic priors of LLMs and suggests that combining semantic grounding with explicit statistical information can unlock practical, scalable all-rank LLM4Rec systems with meaningful efficiency considerations for future improvements.

Abstract

As the focus on Large Language Models (LLMs) in the field of recommendation intensifies, the optimization of LLMs for recommendation purposes (referred to as LLM4Rec) assumes a crucial role in augmenting their effectiveness in providing recommendations. However, existing approaches for LLM4Rec often assess performance using restricted sets of candidates, which may not accurately reflect the models' overall ranking capabilities. In this paper, our objective is to investigate the comprehensive ranking capacity of LLMs and propose a two-step grounding framework known as BIGRec (Bi-step Grounding Paradigm for Recommendation). It initially grounds LLMs to the recommendation space by fine-tuning them to generate meaningful tokens for items and subsequently identifies appropriate actual items that correspond to the generated tokens. By conducting extensive experiments on two datasets, we substantiate the superior performance, capacity for handling few-shot scenarios, and versatility across multiple domains exhibited by BIGRec. Furthermore, we observe that the marginal benefits derived from increasing the quantity of training samples are modest for BIGRec, implying that LLMs possess the limited capability to assimilate statistical information, such as popularity and collaborative filtering, due to their robust semantic priors. These findings also underline the efficacy of integrating diverse statistical information into the LLM4Rec framework, thereby pointing towards a potential avenue for future research. Our code and data are available at https://github.com/SAI990323/Grounding4Rec.

A Bi-Step Grounding Paradigm for Large Language Models in Recommendation Systems

TL;DR

This paper tackles the problem of evaluating LLMs for all-rank recommendations and introduces BIGRec, a bi-step grounding paradigm that connects language generation to real-item recommendations. The approach first grounds the language space to a recommendation space via instruction-tuning, then grounds that space to actual items by embedding-based matching, augmented with popularity and collaborative information to improve grounding. Empirical results on MovieLens10M and a Game subset show BIGRec achieves strong performance in few-shot and cross-domain settings, often matching or surpassing traditional models trained with orders of magnitude more data, and gains further when statistical signals are injected. The work highlights the semantic priors of LLMs and suggests that combining semantic grounding with explicit statistical information can unlock practical, scalable all-rank LLM4Rec systems with meaningful efficiency considerations for future improvements.

Abstract

As the focus on Large Language Models (LLMs) in the field of recommendation intensifies, the optimization of LLMs for recommendation purposes (referred to as LLM4Rec) assumes a crucial role in augmenting their effectiveness in providing recommendations. However, existing approaches for LLM4Rec often assess performance using restricted sets of candidates, which may not accurately reflect the models' overall ranking capabilities. In this paper, our objective is to investigate the comprehensive ranking capacity of LLMs and propose a two-step grounding framework known as BIGRec (Bi-step Grounding Paradigm for Recommendation). It initially grounds LLMs to the recommendation space by fine-tuning them to generate meaningful tokens for items and subsequently identifies appropriate actual items that correspond to the generated tokens. By conducting extensive experiments on two datasets, we substantiate the superior performance, capacity for handling few-shot scenarios, and versatility across multiple domains exhibited by BIGRec. Furthermore, we observe that the marginal benefits derived from increasing the quantity of training samples are modest for BIGRec, implying that LLMs possess the limited capability to assimilate statistical information, such as popularity and collaborative filtering, due to their robust semantic priors. These findings also underline the efficacy of integrating diverse statistical information into the LLM4Rec framework, thereby pointing towards a potential avenue for future research. Our code and data are available at https://github.com/SAI990323/Grounding4Rec.
Paper Structure (23 sections, 3 equations, 6 figures, 3 tables)

This paper contains 23 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of the BIGRec paradigm. During the first step, we ground the language space to recommendation space, which enables the model to generate token sequences of potential items including both actual and hypothetical items. During the second step, we ground the recommendation space to actual item space to provide users with suggestions for real-world items. In the second step, we can easily incorporate statistical information (e.g., popularity and collaborative information) to obtain better recommendations.
  • Figure 2: The two figure show the distribution of items with different popularity and the GPU usage and inference time for different beam sizes on two dataset, respectively.
  • Figure 3: Performance comparison of BIGRec trained on multiple domain data (labeled as "Multi") and BIGRec trained on single target-domain data (labeled as "Single"), shown for NDCG@K on Movie and Game domains.
  • Figure 4: Performance of SASRec, DROS, and BIGRec as training data size increases (denoted by Sample Num), along with their respective performance improvement curves relative to their initial states (1024 training samples). The bottom sub-figures showcase the recommendation performance (measured by NDCG@K), while the upper sub-figures illustrate the improvement.
  • Figure 5: Performance comparison between BIGRec with popularity injection during grounding (labeled as "Injected") and the original BIGRec. NDCG@K and HR@K metrics are displayed for different values of K.
  • ...and 1 more figures