A Survey on Unlearnable Data
Jiahao Li, Yiqiang Chen, Yunbing Xing, Yang Gu, Xiangyuan Lan
TL;DR
This survey addresses the problem of data-driven learning in the face of privacy and ownership concerns by examining Unlearnable Data (ULD), a proactive defense that perturbs training data to hinder learning while preserving human perception. It comprehensively categorizes ULD techniques along dimensions such as technical intent, data modality, task scenario, surrogate dependence, supervision, and perturbation boundedness, and it groups methods into Direct Input Perturbation, Feature Guided Perturbation, Parameter Guided Perturbation, and Hybrid Guided Perturbation. The paper reviews concrete generation methods across images, time-series, text, and other modalities, and discusses targeted attack approaches that attempt to recover learnability, thereby framing an ongoing defense–attack arms race. It further analyzes evaluation metrics (unlearnability, imperceptibility, robustness, transferability, efficiency), standard experimental protocols, and real-world applications in data privacy and intellectual property protection, while outlining key challenges and future directions such as scalability, interpretability, and ethical considerations. Together, these contributions provide a foundational roadmap for advancing robust, transferable, and auditable ULD techniques in diverse ML contexts.
Abstract
Unlearnable data (ULD) has emerged as an innovative defense technique to prevent machine learning models from learning meaningful patterns from specific data, thus protecting data privacy and security. By introducing perturbations to the training data, ULD degrades model performance, making it difficult for unauthorized models to extract useful representations. Despite the growing significance of ULD, existing surveys predominantly focus on related fields, such as adversarial attacks and machine unlearning, with little attention given to ULD as an independent area of study. This survey fills that gap by offering a comprehensive review of ULD, examining unlearnable data generation methods, public benchmarks, evaluation metrics, theoretical foundations and practical applications. We compare and contrast different ULD approaches, analyzing their strengths, limitations, and trade-offs related to unlearnability, imperceptibility, efficiency and robustness. Moreover, we discuss key challenges, such as balancing perturbation imperceptibility with model degradation and the computational complexity of ULD generation. Finally, we highlight promising future research directions to advance the effectiveness and applicability of ULD, underscoring its potential to become a crucial tool in the evolving landscape of data protection in machine learning.
