Table of Contents
Fetching ...

MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning

Jiali Cheng, Hadi Amiri

TL;DR

This work addresses the lack of a standardized, cross-domain evaluation for machine unlearning by introducing MU-Bench, a multitask multimodal benchmark that unifies deleted data and baseline models and extends to audio, video, and biomedical domains. It provides an extensible framework, a unified taxonomy of unlearning strategies under a teacher-student paradigm, and a retrain-free evaluation protocol with standardized datasets, splits, and a leaderboard. Comprehensive experiments reveal that RandLabel and SalUn are generally robust across discriminative tasks, while Bad-T and SCRUB struggle to forget deletion data consistently, with notable challenges in audio and video modalities. The benchmark enables fair comparisons, reveals design choices that influence unlearning performance, and points to future directions in efficient methods, bias mitigation, curriculum strategies, and theoretical guarantees with real-world impact on data rights like the right to be forgotten.

Abstract

Recent advancements in Machine Unlearning (MU) have introduced solutions to selectively remove certain training samples, such as those with outdated or sensitive information, from trained models. Despite these advancements, evaluation of MU methods have been inconsistent, employing different trained models and architectures, and sample removal strategies, which hampers accurate comparison. In addition, prior MU approaches have mainly focused on singular tasks or modalities, which is not comprehensive. To address these limitations, we develop MU-Bench, the first comprehensive benchmark for MU that (i) unifies the sets of deleted samples and trained models, and (ii) provides broad coverage of tasks and data modalities, including previously unexplored domains such as speech and video classification. Our evaluation show that RandLabel and SalUn are the most effective general MU approaches on MU-Bench, and BadT and SCRUB are capable of achieving random performance on the deletion set. We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases. MU-Bench provides an easy-to-use package that includes dataset splits, models, and implementations, together with a leader board to enable unified and scalable MU research.

MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning

TL;DR

This work addresses the lack of a standardized, cross-domain evaluation for machine unlearning by introducing MU-Bench, a multitask multimodal benchmark that unifies deleted data and baseline models and extends to audio, video, and biomedical domains. It provides an extensible framework, a unified taxonomy of unlearning strategies under a teacher-student paradigm, and a retrain-free evaluation protocol with standardized datasets, splits, and a leaderboard. Comprehensive experiments reveal that RandLabel and SalUn are generally robust across discriminative tasks, while Bad-T and SCRUB struggle to forget deletion data consistently, with notable challenges in audio and video modalities. The benchmark enables fair comparisons, reveals design choices that influence unlearning performance, and points to future directions in efficient methods, bias mitigation, curriculum strategies, and theoretical guarantees with real-world impact on data rights like the right to be forgotten.

Abstract

Recent advancements in Machine Unlearning (MU) have introduced solutions to selectively remove certain training samples, such as those with outdated or sensitive information, from trained models. Despite these advancements, evaluation of MU methods have been inconsistent, employing different trained models and architectures, and sample removal strategies, which hampers accurate comparison. In addition, prior MU approaches have mainly focused on singular tasks or modalities, which is not comprehensive. To address these limitations, we develop MU-Bench, the first comprehensive benchmark for MU that (i) unifies the sets of deleted samples and trained models, and (ii) provides broad coverage of tasks and data modalities, including previously unexplored domains such as speech and video classification. Our evaluation show that RandLabel and SalUn are the most effective general MU approaches on MU-Bench, and BadT and SCRUB are capable of achieving random performance on the deletion set. We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases. MU-Bench provides an easy-to-use package that includes dataset splits, models, and implementations, together with a leader board to enable unified and scalable MU research.
Paper Structure (41 sections, 15 figures, 6 tables)

This paper contains 41 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: The MU-Bench benchmark for machine unlearning (MU) spans a comprehensive range of tasks and modalities, including previously unexplored data types such as audio, video, and biomedical data. The open-source package of MU-Bench provides standardized (unified) data splits, implements a suite of commonly-used MU methods and their design choices, enables fast experimentation and fair comparisons across MU methods, and is structured to easily incorporate new datasets and tasks in future.
  • Figure 2: Overall average accuracy across all discriminative tasks.
  • Figure 3: Overall average performance across all generative tasks.
  • Figure 4: Transfer performances.
  • Figure 5: Scaling of $D_f$ performance.
  • ...and 10 more figures