Data-efficient Fine-tuning for LLM-based Recommendation

Xinyu Lin; Wenjie Wang; Yongqi Li; Shuo Yang; Fuli Feng; Yinwei Wei; Tat-Seng Chua

Data-efficient Fine-tuning for LLM-based Recommendation

Xinyu Lin, Wenjie Wang, Yongqi Li, Shuo Yang, Fuli Feng, Yinwei Wei, Tat-Seng Chua

TL;DR

This paper tackles the high cost of fine-tuning LLM-based recommender systems by introducing DEALRec, a data-pruning method that selects a small, representative subset for few-shot fine-tuning. DEALRec combines an influence score, estimated efficiently via influence functions and Hessian-vector products, with an effort score that regularizes for the learning difficulty of LLMs, using a surrogate model to bridge the gap between LLMs and traditional recommenders. The method employs a coverage-enhanced, stratified sampling strategy to preserve diverse training signals, and it is instantiated with two backends (BIGRec and TIGER) using SASRec as the surrogate model. Empirical results on three real-world datasets show that DEALRec achieves higher accuracy than coreset baselines while reducing fine-tuning time by up to 97%, with strong robustness to different surrogate models and selection ratios. Overall, DEALRec demonstrates a practical, scalable pathway for data-efficient, LLM-based recommendation in dynamic, large-scale settings.

Abstract

Leveraging Large Language Models (LLMs) for recommendation has recently garnered considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the cost of fine-tuning LLMs on rapidly expanding recommendation data limits their practical application. To address this challenge, few-shot fine-tuning offers a promising approach to quickly adapt LLMs to new recommendation data. We propose the task of data pruning for efficient LLM-based recommendation, aimed at identifying representative samples tailored for LLMs' few-shot fine-tuning. While coreset selection is closely related to the proposed task, existing coreset selection methods often rely on suboptimal heuristic metrics or entail costly optimization on large-scale recommendation data. To tackle these issues, we introduce two objectives for the data pruning task in the context of LLM-based recommendation: 1) high accuracy aims to identify the influential samples that can lead to high overall performance; and 2) high efficiency underlines the low costs of the data pruning process. To pursue the two objectives, we propose a novel data pruning method based on two scores, i.e., influence score and effort score, to efficiently identify the influential samples. Particularly, the influence score is introduced to accurately estimate the influence of sample removal on the overall performance. To achieve low costs of the data pruning process, we use a small-sized surrogate model to replace LLMs to obtain the influence score. Considering the potential gap between the surrogate model and LLMs, we further propose an effort score to prioritize some hard samples specifically for LLMs. Empirical results on three real-world datasets validate the effectiveness of our proposed method. In particular, the proposed method uses only 2% samples to surpass the full data fine-tuning, reducing time costs by 97%.

Data-efficient Fine-tuning for LLM-based Recommendation

TL;DR

Abstract

Paper Structure (22 sections, 12 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 22 sections, 12 equations, 7 figures, 4 tables, 2 algorithms.

Introduction
Task Formulation
DEALRec
Influence Score
Gap Regularization
Few-shot Fine-tuning
Experiment
Experimental Settings
Datasets.
Baselines.
Implementation.
Overall Performance (RQ1)
In-depth Analysis
Ablation Study (RQ2).
Robustness on different surrogate model (RQ2).
...and 7 more sections

Figures (7)

Figure 1: (a) reveals that BIGRec achieves remarkable performance with only hundreds of samples. (b) shows the low costs of surrogate models.
Figure 2: Overview of DEALRec. DEALRec first trains a surrogate model on the full training samples. Subsequently, it calculates the influence score, which is then regularized by the effort score, to identify influential samples.
Figure 3: (a) depicts the different learning ability due to the prior knowledge in LLMs. (b) presents the distributions of effort scores of LLM and surrogate model on Games dataset.
Figure 4: Ablation study of the influence score, effort score, and coverage-enhanced sample selection strategy.
Figure 5: Performance of DEALRec with different selection ratio $r$w.r.t. accuracy and efficiency on Games.
...and 2 more figures

Data-efficient Fine-tuning for LLM-based Recommendation

TL;DR

Abstract

Data-efficient Fine-tuning for LLM-based Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)