Table of Contents
Fetching ...

Finetuning Large Language Model for Personalized Ranking

Zhuoxi Bai, Ning Wu, Fengyu Cai, Xinyi Zhu, Yun Xiong

TL;DR

Direct Multi-Preference Optimization is introduced, a streamlined framework designed to bridge the gap and enhance the alignment of LLMs for recommendation tasks by simultaneously maximizing the probability of positive samples and minimizing the probability of multiple negative samples.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this study, we introduce Direct Multi-Preference Optimization (DMPO), a streamlined framework designed to bridge the gap and enhance the alignment of LLMs for recommendation tasks. DMPO enhances the performance of LLM-based recommenders by simultaneously maximizing the probability of positive samples and minimizing the probability of multiple negative samples. We conducted experimental evaluations to compare DMPO against traditional recommendation methods and other LLM-based recommendation approaches. The results demonstrate that DMPO significantly improves the recommendation capabilities of LLMs across three real-world public datasets in few-shot scenarios. Additionally, the experiments indicate that DMPO exhibits superior generalization ability in cross-domain recommendations. A case study elucidates the reasons behind these consistent improvements and also underscores DMPO's potential as an explainable recommendation system.

Finetuning Large Language Model for Personalized Ranking

TL;DR

Direct Multi-Preference Optimization is introduced, a streamlined framework designed to bridge the gap and enhance the alignment of LLMs for recommendation tasks by simultaneously maximizing the probability of positive samples and minimizing the probability of multiple negative samples.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this study, we introduce Direct Multi-Preference Optimization (DMPO), a streamlined framework designed to bridge the gap and enhance the alignment of LLMs for recommendation tasks. DMPO enhances the performance of LLM-based recommenders by simultaneously maximizing the probability of positive samples and minimizing the probability of multiple negative samples. We conducted experimental evaluations to compare DMPO against traditional recommendation methods and other LLM-based recommendation approaches. The results demonstrate that DMPO significantly improves the recommendation capabilities of LLMs across three real-world public datasets in few-shot scenarios. Additionally, the experiments indicate that DMPO exhibits superior generalization ability in cross-domain recommendations. A case study elucidates the reasons behind these consistent improvements and also underscores DMPO's potential as an explainable recommendation system.
Paper Structure (34 sections, 5 equations, 8 figures, 3 tables)

This paper contains 34 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: When aligning the LLMs with recommendation tasks, applying Supervised Fine-Tuning (SFT) to LLM initially helps maximize the probability of generating each token in positive items. However, this method overlooks the potential benefits of comparing positive and multiple negative samples. As a result, although the model trained solely with SFT may improve the probability of positive items during training, it may overestimate the probability of unseen negative items during testing, leading to incorrect prediction.
  • Figure 2: We first performed SFT on the base LLM model and then proceeded with DMPO. SFT samples and DMPO samples were constructed as inputs for training. Both SFT and DMPO were trained using LoRA. DMPO aims to maximize the probability of positive samples while minimizing the probability of multiple negative samples simultaneously.
  • Figure 3: Performance for different numbers of negative samples in DMPO. The x-axis label represents the number of negative samples.
  • Figure 4: Performance of using SFT+DMPO and using only DMPO is compared under different few-shot sample numbers.
  • Figure 5: Performance for different base models of DMPO. In the label "SFT+DMPO(1)" and the label "SFT+DMPO(3)", the number in the brackets refers to the number of negative samples used in DMPO.
  • ...and 3 more figures