Direct Preference Optimization for LLM-Enhanced Recommendation Systems
Chao Sun, Yaobo Liang, Yaming Yang, Shilin Xu, Tianmeng Yang, Yunhai Tong
TL;DR
This work tackles the gap between LLM pretraining objectives and recommendation tasks by introducing DPO4Rec, a framework that extracts reasoning knowledge from LLMs, uses a recommender-based reward signal to evaluate that reasoning, and applies Direct Preference Optimization to fine-tune the LLM for improved recommendation alignment. The approach augments traditional sequential recommender models with LLM-derived reasoning and iteratively refines the LLM through pairwise preferences discovered via the reward signal. Across three datasets and multiple backbones, DPO4Rec yields consistent gains in re-ranking metrics and demonstrates robustness across diverse LLMs, indicating that guided instruction-following and knowledge integration can meaningfully enhance LLM-assisted recommendations. The findings suggest a practical pathway to deploy LLMs in production recommender systems with improved relevance and explainability, using a reward-informed, DPO-based alignment strategy.
Abstract
Large Language Models (LLMs) have exhibited remarkable performance across a wide range of domains, motivating research into their potential for recommendation systems. Early efforts have leveraged LLMs' rich knowledge and strong generalization capabilities via in-context learning, where recommendation tasks are framed as prompts. However, LLM performance in recommendation scenarios remains limited due to the mismatch between their pretraining objectives and recommendation tasks, as well as the lack of recommendation-specific data during pretraining. To address these challenges, we propose DPO4Rec, a novel framework that integrates Direct Preference Optimization (DPO) into LLM-enhanced recommendation systems. First, we prompt the LLM to infer user preferences from historical interactions, which are then used to augment traditional ID-based sequential recommendation models. Next, we train a reward model based on knowledge-augmented recommendation architectures to assess the quality of LLM-generated reasoning. Using this, we select the highest- and lowest-ranked responses from N samples to construct a dataset for LLM fine-tuning. Finally, we apply a structure alignment strategy via DPO to align the LLM's outputs with desirable recommendation behavior. Extensive experiments show that DPO4Rec significantly improves re-ranking performance over strong baselines, demonstrating enhanced instruction-following capabilities of LLMs in recommendation tasks.
