Table of Contents
Fetching ...

Direct Preference Optimization for LLM-Enhanced Recommendation Systems

Chao Sun, Yaobo Liang, Yaming Yang, Shilin Xu, Tianmeng Yang, Yunhai Tong

TL;DR

This work tackles the gap between LLM pretraining objectives and recommendation tasks by introducing DPO4Rec, a framework that extracts reasoning knowledge from LLMs, uses a recommender-based reward signal to evaluate that reasoning, and applies Direct Preference Optimization to fine-tune the LLM for improved recommendation alignment. The approach augments traditional sequential recommender models with LLM-derived reasoning and iteratively refines the LLM through pairwise preferences discovered via the reward signal. Across three datasets and multiple backbones, DPO4Rec yields consistent gains in re-ranking metrics and demonstrates robustness across diverse LLMs, indicating that guided instruction-following and knowledge integration can meaningfully enhance LLM-assisted recommendations. The findings suggest a practical pathway to deploy LLMs in production recommender systems with improved relevance and explainability, using a reward-informed, DPO-based alignment strategy.

Abstract

Large Language Models (LLMs) have exhibited remarkable performance across a wide range of domains, motivating research into their potential for recommendation systems. Early efforts have leveraged LLMs' rich knowledge and strong generalization capabilities via in-context learning, where recommendation tasks are framed as prompts. However, LLM performance in recommendation scenarios remains limited due to the mismatch between their pretraining objectives and recommendation tasks, as well as the lack of recommendation-specific data during pretraining. To address these challenges, we propose DPO4Rec, a novel framework that integrates Direct Preference Optimization (DPO) into LLM-enhanced recommendation systems. First, we prompt the LLM to infer user preferences from historical interactions, which are then used to augment traditional ID-based sequential recommendation models. Next, we train a reward model based on knowledge-augmented recommendation architectures to assess the quality of LLM-generated reasoning. Using this, we select the highest- and lowest-ranked responses from N samples to construct a dataset for LLM fine-tuning. Finally, we apply a structure alignment strategy via DPO to align the LLM's outputs with desirable recommendation behavior. Extensive experiments show that DPO4Rec significantly improves re-ranking performance over strong baselines, demonstrating enhanced instruction-following capabilities of LLMs in recommendation tasks.

Direct Preference Optimization for LLM-Enhanced Recommendation Systems

TL;DR

This work tackles the gap between LLM pretraining objectives and recommendation tasks by introducing DPO4Rec, a framework that extracts reasoning knowledge from LLMs, uses a recommender-based reward signal to evaluate that reasoning, and applies Direct Preference Optimization to fine-tune the LLM for improved recommendation alignment. The approach augments traditional sequential recommender models with LLM-derived reasoning and iteratively refines the LLM through pairwise preferences discovered via the reward signal. Across three datasets and multiple backbones, DPO4Rec yields consistent gains in re-ranking metrics and demonstrates robustness across diverse LLMs, indicating that guided instruction-following and knowledge integration can meaningfully enhance LLM-assisted recommendations. The findings suggest a practical pathway to deploy LLMs in production recommender systems with improved relevance and explainability, using a reward-informed, DPO-based alignment strategy.

Abstract

Large Language Models (LLMs) have exhibited remarkable performance across a wide range of domains, motivating research into their potential for recommendation systems. Early efforts have leveraged LLMs' rich knowledge and strong generalization capabilities via in-context learning, where recommendation tasks are framed as prompts. However, LLM performance in recommendation scenarios remains limited due to the mismatch between their pretraining objectives and recommendation tasks, as well as the lack of recommendation-specific data during pretraining. To address these challenges, we propose DPO4Rec, a novel framework that integrates Direct Preference Optimization (DPO) into LLM-enhanced recommendation systems. First, we prompt the LLM to infer user preferences from historical interactions, which are then used to augment traditional ID-based sequential recommendation models. Next, we train a reward model based on knowledge-augmented recommendation architectures to assess the quality of LLM-generated reasoning. Using this, we select the highest- and lowest-ranked responses from N samples to construct a dataset for LLM fine-tuning. Finally, we apply a structure alignment strategy via DPO to align the LLM's outputs with desirable recommendation behavior. Extensive experiments show that DPO4Rec significantly improves re-ranking performance over strong baselines, demonstrating enhanced instruction-following capabilities of LLMs in recommendation tasks.
Paper Structure (26 sections, 1 equation, 5 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 1 equation, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between (a) Unidirectional LLM-enhanced recommendations and (b) Bidirectional LLM-enhanced recommendations.
  • Figure 2: The overview of the proposed framework.
  • Figure 3: A prompt example designed for extracting reasoning-based knowledge about user preferences from LLMs. The blue bubble illustrates the prompt template, which becomes a complete prompt by injecting specific user-related content (orange bubble) into it. The final prompt then guides the LLM in generating inferred user preferences, as represented in the green bubble.
  • Figure 4: Ablation study about Knowledge and DPO
  • Figure 5: Analysis of iterations and samples.