Table of Contents
Fetching ...

Leveraging LLM Reasoning Enhances Personalized Recommender Systems

Alicia Y. Tsai, Adam Kraft, Long Jin, Chenwei Cai, Anahita Hosseini, Taibai Xu, Zemin Zhang, Lichan Hong, Ed H. Chi, Xinyang Yi

TL;DR

This work investigates the use of large language model (LLM) reasoning to improve personalized recommender systems (RecSys). It demonstrates that prompting LLMs to generate reasoning via zero-shot chain-of-thought (CoT) and fine-tuning with reasoning data can enhance rating prediction performance, especially when rich user history and item descriptions are available; it also introduces Rec-SAVER, an automatic framework for evaluating the quality of LLM reasoning without curated gold references, including a self-verification mechanism and human judgment alignment studies. The results show that reasoning-enhanced RecSys benefits from larger pre-trained knowledge bases and domain-specific information, with benefits varying by domain. Overall, RecSAVER provides a practical, human-aligned tool for assessing reasoning quality in RecSys, and the findings support the growing role of explainable and reasoning-enabled LLMs in personalized recommendations.

Abstract

Recent advancements have showcased the potential of Large Language Models (LLMs) in executing reasoning tasks, particularly facilitated by Chain-of-Thought (CoT) prompting. While tasks like arithmetic reasoning involve clear, definitive answers and logical chains of thought, the application of LLM reasoning in recommendation systems (RecSys) presents a distinct challenge. RecSys tasks revolve around subjectivity and personalized preferences, an under-explored domain in utilizing LLMs' reasoning capabilities. Our study explores several aspects to better understand reasoning for RecSys and demonstrate how task quality improves by utilizing LLM reasoning in both zero-shot and finetuning settings. Additionally, we propose RecSAVER (Recommender Systems Automatic Verification and Evaluation of Reasoning) to automatically assess the quality of LLM reasoning responses without the requirement of curated gold references or human raters. We show that our framework aligns with real human judgment on the coherence and faithfulness of reasoning responses. Overall, our work shows that incorporating reasoning into RecSys can improve personalized tasks, paving the way for further advancements in recommender system methodologies.

Leveraging LLM Reasoning Enhances Personalized Recommender Systems

TL;DR

This work investigates the use of large language model (LLM) reasoning to improve personalized recommender systems (RecSys). It demonstrates that prompting LLMs to generate reasoning via zero-shot chain-of-thought (CoT) and fine-tuning with reasoning data can enhance rating prediction performance, especially when rich user history and item descriptions are available; it also introduces Rec-SAVER, an automatic framework for evaluating the quality of LLM reasoning without curated gold references, including a self-verification mechanism and human judgment alignment studies. The results show that reasoning-enhanced RecSys benefits from larger pre-trained knowledge bases and domain-specific information, with benefits varying by domain. Overall, RecSAVER provides a practical, human-aligned tool for assessing reasoning quality in RecSys, and the findings support the growing role of explainable and reasoning-enabled LLMs in personalized recommendations.

Abstract

Recent advancements have showcased the potential of Large Language Models (LLMs) in executing reasoning tasks, particularly facilitated by Chain-of-Thought (CoT) prompting. While tasks like arithmetic reasoning involve clear, definitive answers and logical chains of thought, the application of LLM reasoning in recommendation systems (RecSys) presents a distinct challenge. RecSys tasks revolve around subjectivity and personalized preferences, an under-explored domain in utilizing LLMs' reasoning capabilities. Our study explores several aspects to better understand reasoning for RecSys and demonstrate how task quality improves by utilizing LLM reasoning in both zero-shot and finetuning settings. Additionally, we propose RecSAVER (Recommender Systems Automatic Verification and Evaluation of Reasoning) to automatically assess the quality of LLM reasoning responses without the requirement of curated gold references or human raters. We show that our framework aligns with real human judgment on the coherence and faithfulness of reasoning responses. Overall, our work shows that incorporating reasoning into RecSys can improve personalized tasks, paving the way for further advancements in recommender system methodologies.
Paper Structure (30 sections, 4 equations, 5 figures, 15 tables, 1 algorithm)

This paper contains 30 sections, 4 equations, 5 figures, 15 tables, 1 algorithm.

Figures (5)

  • Figure 1: Landscape of recommender systems tasks, with user feedback extent on the vertical axis and decision-making effort on the horizontal. For example, a user clicking on websites requires low effort and does not provide much feedback about the user's satisfaction. Conversely, a user rating and reviewing products requires more effort and provides better satisfaction signals.
  • Figure 2: We prompt the LLM to generate a reasoning output prior to outputting the final task prediction.
  • Figure 3: Fine-tuning a model with reasoning. We first collect multiple reasoning samples by prompting a Large LM. We then use the reasoning samples combined with the original rating ground truth labels to fine-tune a different (potentially smaller) LM. We can optionally filter the reasoning outputs by comparing the Large LM rating predictions with the ground truth ratings.
  • Figure 4: Overview of Rec-SAVER utilizing LLM-generated references and LLM self-verification. The first LLM call uses the ground truth rating labels as additional input to generate post hoc reasoning generated reference. We then do a subsequent LLM call passing in the generated reasoning reference and collect a new rating prediction. We keep only the predictions where the final rating prediction matches the ground truth rating label as our verified references. These verified references are then used to evaluate the reasoning outputs from other LLMs.
  • Figure 5: Outputs reasons categorized based on the correctness of rating predictions.