Table of Contents
Fetching ...

On Softmax Direct Preference Optimization for Recommendation

Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, Tat-Seng Chua

TL;DR

This work identifies that standard LM-based recommender training optimizes a next-token objective, which underutilizes user preferences for ranking. It introduces Softmax-DPO (S-DPO), a multi-negative, listwise extension of Direct Preference Optimization that leverages a Plackett-Luce–style ranking and connects to softmax sampling, enabling hard-negative mining. The authors provide a theoretical derivation linking S-DPO to DPO and to softmax/contrastive losses, and demonstrate that multi-negative signaling yields stronger ranking gradients and better rewards for preferred items. Empirically, S-DPO consistently outperforms traditional and LM-based baselines across MovieLens, Goodreads, and LastFM, with substantial HR@1 gains and high validity of responses, suggesting practical impact for improving LM-based recommender systems.

Abstract

Recommender systems aim to predict personalized rankings based on user preference data. With the rise of Language Models (LMs), LM-based recommenders have been widely explored due to their extensive world knowledge and powerful reasoning abilities. Most of the LM-based recommenders convert historical interactions into language prompts, pairing with a positive item as the target response and fine-tuning LM with a language modeling loss. However, the current objective fails to fully leverage preference data and is not optimized for personalized ranking tasks, which hinders the performance of LM-based recommenders. Inspired by the current advancement of Direct Preference Optimization (DPO) in human preference alignment and the success of softmax loss in recommendations, we propose Softmax-DPO (S-DPO) to instill ranking information into the LM to help LM-based recommenders distinguish preferred items from negatives, rather than solely focusing on positives. Specifically, we incorporate multiple negatives in user preference data and devise an alternative version of DPO loss tailored for LM-based recommenders, which is extended from the traditional full-ranking Plackett-Luce (PL) model to partial rankings and connected to softmax sampling strategies. Theoretically, we bridge S-DPO with the softmax loss over negative sampling and find that it has an inherent benefit of mining hard negatives, which assures its exceptional capabilities in recommendation tasks. Empirically, extensive experiments conducted on three real-world datasets demonstrate the superiority of S-DPO to effectively model user preference and further boost recommendation performance while providing better rewards for preferred items. Our codes are available at https://github.com/chenyuxin1999/S-DPO.

On Softmax Direct Preference Optimization for Recommendation

TL;DR

This work identifies that standard LM-based recommender training optimizes a next-token objective, which underutilizes user preferences for ranking. It introduces Softmax-DPO (S-DPO), a multi-negative, listwise extension of Direct Preference Optimization that leverages a Plackett-Luce–style ranking and connects to softmax sampling, enabling hard-negative mining. The authors provide a theoretical derivation linking S-DPO to DPO and to softmax/contrastive losses, and demonstrate that multi-negative signaling yields stronger ranking gradients and better rewards for preferred items. Empirically, S-DPO consistently outperforms traditional and LM-based baselines across MovieLens, Goodreads, and LastFM, with substantial HR@1 gains and high validity of responses, suggesting practical impact for improving LM-based recommender systems.

Abstract

Recommender systems aim to predict personalized rankings based on user preference data. With the rise of Language Models (LMs), LM-based recommenders have been widely explored due to their extensive world knowledge and powerful reasoning abilities. Most of the LM-based recommenders convert historical interactions into language prompts, pairing with a positive item as the target response and fine-tuning LM with a language modeling loss. However, the current objective fails to fully leverage preference data and is not optimized for personalized ranking tasks, which hinders the performance of LM-based recommenders. Inspired by the current advancement of Direct Preference Optimization (DPO) in human preference alignment and the success of softmax loss in recommendations, we propose Softmax-DPO (S-DPO) to instill ranking information into the LM to help LM-based recommenders distinguish preferred items from negatives, rather than solely focusing on positives. Specifically, we incorporate multiple negatives in user preference data and devise an alternative version of DPO loss tailored for LM-based recommenders, which is extended from the traditional full-ranking Plackett-Luce (PL) model to partial rankings and connected to softmax sampling strategies. Theoretically, we bridge S-DPO with the softmax loss over negative sampling and find that it has an inherent benefit of mining hard negatives, which assures its exceptional capabilities in recommendation tasks. Empirically, extensive experiments conducted on three real-world datasets demonstrate the superiority of S-DPO to effectively model user preference and further boost recommendation performance while providing better rewards for preferred items. Our codes are available at https://github.com/chenyuxin1999/S-DPO.
Paper Structure (36 sections, 22 equations, 3 figures, 4 tables)

This paper contains 36 sections, 22 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Framework of S-DPO. Different from existing methods which fine-tune LMs with a language modeling loss without tailoring for recommendations, S-DPO proposes to explicitly instill ranking information into LMs. To take one step further, S-DPO incorporates multiple negatives in user preference data and generalizes pairwise DPO loss to softmax ranking loss.
  • Figure 2: Study on S-DPO. (\ref{['fig:ablation_performance']}) Ablation study of S-DPO compared with SFT and DPO on three datasets. (\ref{['fig:valid_loss']}) Comparison of the trend of validation loss between DPO and S-DPO on LastFM. (\ref{['fig:chosen_reward']}) Comparison of the reward of preferred items between DPO and S-DPO on LastFM.
  • Figure 3: Studies on values of $\beta$ and negative samples numbers of S-DPO on LastFM. (\ref{['fig:num_neg']}) Performance comparisons with varying numbers of negative samples ($\beta$ = 1). (\ref{['fig:beta_hr']}) Performance comparisons with varying values of $\beta$ setting negative samples number as 3. (\ref{['fig:beta_vr']}) Validity comparisons with varying values of $\beta$ setting negative samples number as 3.