Learning from Convolution-based Unlearnable Datasets
Dohyun Kim, Pedro Sandoval-Segura
TL;DR
The paper addresses protecting data from unauthorized training by proposing the Convolution-based Unlearnable Dataset (CUDA), which uses class-wise blur to render data unlearnable. It investigates whether such unlearnable data remains unusable under countermeasures and demonstrates that Random Sharpening Kernels (RSK) combined with Discrete Cosine Transform–based Frequency Filtering (FF) restore learnability, achieving substantial improvements in test accuracy over adversarial training on CIFAR-10, CIFAR-100, and ImageNet-100. The key insight is that shortcut learning induced by CUDA can be broken by manipulating image frequencies and sharpening, suggesting that unlearnable data methods require stronger defenses. The work informs data privacy practice by showing that robust unlearnable datasets remain a moving target and emphasizes the need for ongoing development of data-protection techniques.
Abstract
The construction of large datasets for deep learning has raised concerns regarding unauthorized use of online data, leading to increased interest in protecting data from third-parties who want to use it for training. The Convolution-based Unlearnable DAtaset (CUDA) method aims to make data unlearnable by applying class-wise blurs to every image in the dataset so that neural networks learn relations between blur kernels and labels, as opposed to informative features for classifying clean data. In this work, we evaluate whether CUDA data remains unlearnable after image sharpening and frequency filtering, finding that this combination of simple transforms improves the utility of CUDA data for training. In particular, we observe a substantial increase in test accuracy over adversarial training for models trained with CUDA unlearnable data from CIFAR-10, CIFAR-100, and ImageNet-100. In training models to high accuracy using unlearnable data, we underscore the need for ongoing refinement in data poisoning techniques to ensure data privacy. Our method opens new avenues for enhancing the robustness of unlearnable datasets by highlighting that simple methods such as sharpening and frequency filtering are capable of breaking convolution-based unlearnable datasets.
