Table of Contents
Fetching ...

Shapley Value-driven Data Pruning for Recommender Systems

Yansen Zhang, Xiaokun Zhang, Ziqiang Cui, Chen Ma

TL;DR

This work tackles noisy implicit feedback in recommender systems by shifting from intent-based denoising to a principled data-valuation approach. It introduces SVV, which uses FastSHAP to estimate the marginal training utility of each user–item interaction via Shapley values, enabling real-time pruning of low-utility data. A simulated noise protocol provides a verifiable ground truth for evaluating denoising effectiveness, and extensive experiments on four real datasets show that SVV outperforms existing denoising methods in accuracy and robustness while preserving training-critical interactions. The findings offer interpretable, model-driven guidance for data cleaning in recommender pipelines with practical implications for improving robustness and learning efficiency.

Abstract

Recommender systems often suffer from noisy interactions like accidental clicks or popularity bias. Existing denoising methods typically identify users' intent in their interactions, and filter out noisy interactions that deviate from the assumed intent. However, they ignore that interactions deemed noisy could still aid model training, while some ``clean'' interactions offer little learning value. To bridge this gap, we propose Shapley Value-driven Valuation (SVV), a framework that evaluates interactions based on their objective impact on model training rather than subjective intent assumptions. In SVV, a real-time Shapley value estimation method is devised to quantify each interaction's value based on its contribution to reducing training loss. Afterward, SVV highlights the interactions with high values while downplaying low ones to achieve effective data pruning for recommender systems. In addition, we develop a simulated noise protocol to examine the performance of various denoising approaches systematically. Experiments on four real-world datasets show that SVV outperforms existing denoising methods in both accuracy and robustness. Further analysis also demonstrates that our SVV can preserve training-critical interactions and offer interpretable noise assessment. This work shifts denoising from heuristic filtering to principled, model-driven interaction valuation.

Shapley Value-driven Data Pruning for Recommender Systems

TL;DR

This work tackles noisy implicit feedback in recommender systems by shifting from intent-based denoising to a principled data-valuation approach. It introduces SVV, which uses FastSHAP to estimate the marginal training utility of each user–item interaction via Shapley values, enabling real-time pruning of low-utility data. A simulated noise protocol provides a verifiable ground truth for evaluating denoising effectiveness, and extensive experiments on four real datasets show that SVV outperforms existing denoising methods in accuracy and robustness while preserving training-critical interactions. The findings offer interpretable, model-driven guidance for data cleaning in recommender pipelines with practical implications for improving robustness and learning efficiency.

Abstract

Recommender systems often suffer from noisy interactions like accidental clicks or popularity bias. Existing denoising methods typically identify users' intent in their interactions, and filter out noisy interactions that deviate from the assumed intent. However, they ignore that interactions deemed noisy could still aid model training, while some ``clean'' interactions offer little learning value. To bridge this gap, we propose Shapley Value-driven Valuation (SVV), a framework that evaluates interactions based on their objective impact on model training rather than subjective intent assumptions. In SVV, a real-time Shapley value estimation method is devised to quantify each interaction's value based on its contribution to reducing training loss. Afterward, SVV highlights the interactions with high values while downplaying low ones to achieve effective data pruning for recommender systems. In addition, we develop a simulated noise protocol to examine the performance of various denoising approaches systematically. Experiments on four real-world datasets show that SVV outperforms existing denoising methods in both accuracy and robustness. Further analysis also demonstrates that our SVV can preserve training-critical interactions and offer interpretable noise assessment. This work shifts denoising from heuristic filtering to principled, model-driven interaction valuation.

Paper Structure

This paper contains 23 sections, 20 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Performance comparisons of the Clean (all rating score ([1, 5])) training and Clean3 (all rating score ([1, 5]) $\geq$ 3) training in terms of Recall@10 and NDCG@10 over DAE.
  • Figure 2: Contrasting Shapley value-based exclusion strategies: segment vs. cumulative impact on recommendation performance (Recall@10 and NDCG@10) and model training (Value Function) on CDs dataset.