Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

Mingqiao Zhang; Qiyao Peng; Yumeng Wang; Chunyuan Liu; Hongtao Liu

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

Mingqiao Zhang, Qiyao Peng, Yumeng Wang, Chunyuan Liu, Hongtao Liu

TL;DR

This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation, which reveals that data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance.

Abstract

The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre-training of foundation models on strategically blended corpora, which include user-item interactions from both in-domain and out-of-domain sources. Our experiments reveal a dual-effect of data leakage: when the leaked data is domain-relevant, it induces substantial but spurious performance gains, misleadingly exaggerating the model's capability. In contrast, domain-irrelevant leakage typically degrades recommendation accuracy, highlighting the complex and contingent nature of this contamination. Our findings reveal that data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance. We release our code at https://github.com/yusba1/LLMRec-Data-Leakage.

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

TL;DR

Abstract

Paper Structure (22 sections, 9 equations, 3 figures, 6 tables)

This paper contains 22 sections, 9 equations, 3 figures, 6 tables.

Introduction
Related Work
Large Language Models
LLM-based Recommendation
Methodology
Overview
Leakage Data Construction
Injecting leaked data
Loss definition
Contaminated adaptation (Dirty LLM)
LoRA parameterization and what is updated
Justification for LoRA as a controlled proxy
Experiment
Experiment settings
Datasets
...and 7 more sections

Figures (3)

Figure 1: Overview of the Experimental Framework. The workflow illustrates the construction of a Mixed Leakage Corpus through strategic sampling of In-Domain (ID) and Out-Of-Domain (OOD) data. The Clean Path represents the baseline evaluation using a frozen Base LLM, while the Dirty Path simulates benchmark contamination via LoRA injection. By comparing the Clean and Dirty Recommenders, we identify the Triple Effect of Leakage: spurious performance gains, stability, or degradation.
Figure 2: AUC under different data injection methods
Figure 3: UAUC under different data injection methods

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

TL;DR

Abstract

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)