CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence

Chaochao Chen; Jiaming Zhang; Yizhao Zhang; Li Zhang; Lingjuan Lyu; Yuyuan Li; Biao Gong; Chenggang Yan

CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence

Chaochao Chen, Jiaming Zhang, Yizhao Zhang, Li Zhang, Lingjuan Lyu, Yuyuan Li, Biao Gong, Chenggang Yan

TL;DR

The paper tackles privacy concerns surrounding data deletion rights in recommender systems by introducing CURE4Rec, the first comprehensive benchmark for evaluation of recommendation unlearning. It defines four evaluation aspects—unlearning completeness, recommendation utility, unlearning efficiency, and recommendation fairness—across three unlearning-set strategies (core data, edge data, random data), enabling robust comparisons between EU (exact) and AU (approximate) methods. The study reveals that EU methods guarantee completeness but can degrade utility and fairness, while AU methods like SCIF improve efficiency and fairness with a modest trade-off in completeness, highlighting important design trade-offs. These findings guide future development of unlearning techniques and evaluation protocols, illustrating that fairness and robustness considerations must be incorporated into practical unlearning systems. The authors also provide code and datasets to facilitate public benchmarking and reproducibility.

Abstract

With increasing privacy concerns in artificial intelligence, regulations have mandated the right to be forgotten, granting individuals the right to withdraw their data from models. Machine unlearning has emerged as a potential solution to enable selective forgetting in models, particularly in recommender systems where historical data contains sensitive user information. Despite recent advances in recommendation unlearning, evaluating unlearning methods comprehensively remains challenging due to the absence of a unified evaluation framework and overlooked aspects of deeper influence, e.g., fairness. To address these gaps, we propose CURE4Rec, the first comprehensive benchmark for recommendation unlearning evaluation. CURE4Rec covers four aspects, i.e., unlearning Completeness, recommendation Utility, unleaRning efficiency, and recommendation fairnEss, under three data selection strategies, i.e., core data, edge data, and random data. Specifically, we consider the deeper influence of unlearning on recommendation fairness and robustness towards data with varying impact levels. We construct multiple datasets with CURE4Rec evaluation and conduct extensive experiments on existing recommendation unlearning methods. Our code is released at https://github.com/xiye7lai/CURE4Rec.

CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 8 figures, 8 tables)

This paper contains 33 sections, 1 equation, 8 figures, 8 tables.

Introduction
Related Work
Machine Unlearning
Recommendation Unlearning
Machine Unlearning Benchmarks
CURE4Rec
Recommendation Unlearning
Evaluation Aspects
Unlearning Completeness.
Recommendation Utility.
Unlearning Efficiency.
Recommendation Fairness.
Unlearning Set Selection
Experimental Setup
Datasets
...and 18 more sections

Figures (8)

Figure 1: An illustration of CURE4Rec, a comprehensive benchmark tailored for evaluating recommendation unlearning methods. CURE4Rec evaluates unlearning methods using data with varying levels of unlearning impact on four aspects, i.e., unlearning completeness, recommendation utility, unlearning efficiency, and recommendation fairness.
Figure 2: A visualized evaluation overview of recommendation unlearning methods in four aspects ($\uparrow$), where the result is the normalized average outcome obtained across all models and datasets, using random data as the unlearning set. The recommendation fairness is measured by A-IGF (fairness between active and inactive users). The higher values represent better performance.
Figure 3: Results in terms of recommendation fairness for exact recommendation unlearning methods on WMF, where A-IGF (approaching Retrain) and shardGF ($\downarrow$) evaluate the fairness of group-level and shard-level, respectively.
Figure 4: Effect of shard number in terms of multiple aspects, i.e., recommendation utility ($\uparrow$), unlearning efficiency ($\downarrow$), group-level fairness (approaching Retrain), and shard-level fairness ($\downarrow$).
Figure 5: A visualized evaluation overview of recommendation unlearning methods in four aspects ($\uparrow$), where the result is the normalized average outcome obtained across all models, using random data as the unlearning set. The recommendation fairness is measured by A-IGF (fairness between active and inactive users).
...and 3 more figures

CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence

TL;DR

Abstract

CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence

Authors

TL;DR

Abstract

Table of Contents

Figures (8)