Table of Contents
Fetching ...

RosePO: Aligning LLM-based Recommenders with Human Values

Jiayi Liao, Xiangnan He, Ruobing Xie, Jiancan Wu, Yancheng Yuan, Xingwu Sun, Zhanhui Kang, Xiang Wang

TL;DR

This work proposes a general framework -- Recommendation with smoothing personalized Preference Optimization (RosePO), which better aligns with customized human values during the post-training stage and introduces a personalized smoothing factor predicted by a preference oracle into the optimization objective.

Abstract

Recently, there has been a growing interest in leveraging Large Language Models (LLMs) for recommendation systems, which usually adapt a pre-trained LLM to the recommendation scenario through supervised fine-tuning (SFT). However, both the pre-training and SFT stages fail to explicitly model the comparative relationships of a user's preferences on different items. To construct a "helpful and harmless" LLM-based recommender, we propose a general framework -- Recommendation with smoothing personalized Preference Optimization (RosePO), which better aligns with customized human values during the post-training stage. Specifically, in addition to the input and chosen response that naturally align with SFT data, we design a rejected sampling strategy tailored for enhancing helpfulness, along with two strategies aimed at mitigating biases to promote harmlessness. To ensure robustness against uncertain labels present in automatically constructed preference data, we introduce a personalized smoothing factor predicted by a preference oracle into the optimization objective. Evaluation on three real-world datasets demonstrates the effectiveness of our method, showcasing not only improved recommendation performance but also mitigation of semantic hallucination and popularity bias.

RosePO: Aligning LLM-based Recommenders with Human Values

TL;DR

This work proposes a general framework -- Recommendation with smoothing personalized Preference Optimization (RosePO), which better aligns with customized human values during the post-training stage and introduces a personalized smoothing factor predicted by a preference oracle into the optimization objective.

Abstract

Recently, there has been a growing interest in leveraging Large Language Models (LLMs) for recommendation systems, which usually adapt a pre-trained LLM to the recommendation scenario through supervised fine-tuning (SFT). However, both the pre-training and SFT stages fail to explicitly model the comparative relationships of a user's preferences on different items. To construct a "helpful and harmless" LLM-based recommender, we propose a general framework -- Recommendation with smoothing personalized Preference Optimization (RosePO), which better aligns with customized human values during the post-training stage. Specifically, in addition to the input and chosen response that naturally align with SFT data, we design a rejected sampling strategy tailored for enhancing helpfulness, along with two strategies aimed at mitigating biases to promote harmlessness. To ensure robustness against uncertain labels present in automatically constructed preference data, we introduce a personalized smoothing factor predicted by a preference oracle into the optimization objective. Evaluation on three real-world datasets demonstrates the effectiveness of our method, showcasing not only improved recommendation performance but also mitigation of semantic hallucination and popularity bias.

Paper Structure

This paper contains 32 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Training of an LLM-based recommender.
  • Figure 2: Since this user loves romantic movies, the preference for her subsequent interaction with "The Bridges of Madison County" over "Before Sunrise" exhibits greater uncertainty than over "Toy Story".
  • Figure 3: The framework of RosePO comprises two main components: preference data construction and personalized optimization. (1) The rejected items are sampled based on specific preferences for HH. (2) We estimate the uncertainty of preference for each data point guided by a preference oracle, and inject this personalized uncertainty as a smoothing factor into the optimization objective.
  • Figure 4: HR@1 of SFT and RosePO-s on ranking semantic-similar item candidates.
  • Figure 5: Distribution of semantic bias and popularity bias for SFT and RosePO (-s and -p).
  • ...and 1 more figures