Table of Contents
Fetching ...

A Survey on Unlearnable Data

Jiahao Li, Yiqiang Chen, Yunbing Xing, Yang Gu, Xiangyuan Lan

TL;DR

This survey addresses the problem of data-driven learning in the face of privacy and ownership concerns by examining Unlearnable Data (ULD), a proactive defense that perturbs training data to hinder learning while preserving human perception. It comprehensively categorizes ULD techniques along dimensions such as technical intent, data modality, task scenario, surrogate dependence, supervision, and perturbation boundedness, and it groups methods into Direct Input Perturbation, Feature Guided Perturbation, Parameter Guided Perturbation, and Hybrid Guided Perturbation. The paper reviews concrete generation methods across images, time-series, text, and other modalities, and discusses targeted attack approaches that attempt to recover learnability, thereby framing an ongoing defense–attack arms race. It further analyzes evaluation metrics (unlearnability, imperceptibility, robustness, transferability, efficiency), standard experimental protocols, and real-world applications in data privacy and intellectual property protection, while outlining key challenges and future directions such as scalability, interpretability, and ethical considerations. Together, these contributions provide a foundational roadmap for advancing robust, transferable, and auditable ULD techniques in diverse ML contexts.

Abstract

Unlearnable data (ULD) has emerged as an innovative defense technique to prevent machine learning models from learning meaningful patterns from specific data, thus protecting data privacy and security. By introducing perturbations to the training data, ULD degrades model performance, making it difficult for unauthorized models to extract useful representations. Despite the growing significance of ULD, existing surveys predominantly focus on related fields, such as adversarial attacks and machine unlearning, with little attention given to ULD as an independent area of study. This survey fills that gap by offering a comprehensive review of ULD, examining unlearnable data generation methods, public benchmarks, evaluation metrics, theoretical foundations and practical applications. We compare and contrast different ULD approaches, analyzing their strengths, limitations, and trade-offs related to unlearnability, imperceptibility, efficiency and robustness. Moreover, we discuss key challenges, such as balancing perturbation imperceptibility with model degradation and the computational complexity of ULD generation. Finally, we highlight promising future research directions to advance the effectiveness and applicability of ULD, underscoring its potential to become a crucial tool in the evolving landscape of data protection in machine learning.

A Survey on Unlearnable Data

TL;DR

This survey addresses the problem of data-driven learning in the face of privacy and ownership concerns by examining Unlearnable Data (ULD), a proactive defense that perturbs training data to hinder learning while preserving human perception. It comprehensively categorizes ULD techniques along dimensions such as technical intent, data modality, task scenario, surrogate dependence, supervision, and perturbation boundedness, and it groups methods into Direct Input Perturbation, Feature Guided Perturbation, Parameter Guided Perturbation, and Hybrid Guided Perturbation. The paper reviews concrete generation methods across images, time-series, text, and other modalities, and discusses targeted attack approaches that attempt to recover learnability, thereby framing an ongoing defense–attack arms race. It further analyzes evaluation metrics (unlearnability, imperceptibility, robustness, transferability, efficiency), standard experimental protocols, and real-world applications in data privacy and intellectual property protection, while outlining key challenges and future directions such as scalability, interpretability, and ethical considerations. Together, these contributions provide a foundational roadmap for advancing robust, transferable, and auditable ULD techniques in diverse ML contexts.

Abstract

Unlearnable data (ULD) has emerged as an innovative defense technique to prevent machine learning models from learning meaningful patterns from specific data, thus protecting data privacy and security. By introducing perturbations to the training data, ULD degrades model performance, making it difficult for unauthorized models to extract useful representations. Despite the growing significance of ULD, existing surveys predominantly focus on related fields, such as adversarial attacks and machine unlearning, with little attention given to ULD as an independent area of study. This survey fills that gap by offering a comprehensive review of ULD, examining unlearnable data generation methods, public benchmarks, evaluation metrics, theoretical foundations and practical applications. We compare and contrast different ULD approaches, analyzing their strengths, limitations, and trade-offs related to unlearnability, imperceptibility, efficiency and robustness. Moreover, we discuss key challenges, such as balancing perturbation imperceptibility with model degradation and the computational complexity of ULD generation. Finally, we highlight promising future research directions to advance the effectiveness and applicability of ULD, underscoring its potential to become a crucial tool in the evolving landscape of data protection in machine learning.

Paper Structure

This paper contains 73 sections, 77 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The Illustration of Unlearnable Data in Machine Learning.
  • Figure 2: An overview of the structure of the survey.
  • Figure 3: The timeline of unlearnable data (ULD) research and related studies. The lock symbol "" represents the defense method, the cross-star symbol "" represents the attack method, the balance symbol "" represents the evaluation method, and the rocket symbol "" represents the performance acceleration method.