Table of Contents
Fetching ...

Learning from Convolution-based Unlearnable Datasets

Dohyun Kim, Pedro Sandoval-Segura

TL;DR

The paper addresses protecting data from unauthorized training by proposing the Convolution-based Unlearnable Dataset (CUDA), which uses class-wise blur to render data unlearnable. It investigates whether such unlearnable data remains unusable under countermeasures and demonstrates that Random Sharpening Kernels (RSK) combined with Discrete Cosine Transform–based Frequency Filtering (FF) restore learnability, achieving substantial improvements in test accuracy over adversarial training on CIFAR-10, CIFAR-100, and ImageNet-100. The key insight is that shortcut learning induced by CUDA can be broken by manipulating image frequencies and sharpening, suggesting that unlearnable data methods require stronger defenses. The work informs data privacy practice by showing that robust unlearnable datasets remain a moving target and emphasizes the need for ongoing development of data-protection techniques.

Abstract

The construction of large datasets for deep learning has raised concerns regarding unauthorized use of online data, leading to increased interest in protecting data from third-parties who want to use it for training. The Convolution-based Unlearnable DAtaset (CUDA) method aims to make data unlearnable by applying class-wise blurs to every image in the dataset so that neural networks learn relations between blur kernels and labels, as opposed to informative features for classifying clean data. In this work, we evaluate whether CUDA data remains unlearnable after image sharpening and frequency filtering, finding that this combination of simple transforms improves the utility of CUDA data for training. In particular, we observe a substantial increase in test accuracy over adversarial training for models trained with CUDA unlearnable data from CIFAR-10, CIFAR-100, and ImageNet-100. In training models to high accuracy using unlearnable data, we underscore the need for ongoing refinement in data poisoning techniques to ensure data privacy. Our method opens new avenues for enhancing the robustness of unlearnable datasets by highlighting that simple methods such as sharpening and frequency filtering are capable of breaking convolution-based unlearnable datasets.

Learning from Convolution-based Unlearnable Datasets

TL;DR

The paper addresses protecting data from unauthorized training by proposing the Convolution-based Unlearnable Dataset (CUDA), which uses class-wise blur to render data unlearnable. It investigates whether such unlearnable data remains unusable under countermeasures and demonstrates that Random Sharpening Kernels (RSK) combined with Discrete Cosine Transform–based Frequency Filtering (FF) restore learnability, achieving substantial improvements in test accuracy over adversarial training on CIFAR-10, CIFAR-100, and ImageNet-100. The key insight is that shortcut learning induced by CUDA can be broken by manipulating image frequencies and sharpening, suggesting that unlearnable data methods require stronger defenses. The work informs data privacy practice by showing that robust unlearnable datasets remain a moving target and emphasizes the need for ongoing development of data-protection techniques.

Abstract

The construction of large datasets for deep learning has raised concerns regarding unauthorized use of online data, leading to increased interest in protecting data from third-parties who want to use it for training. The Convolution-based Unlearnable DAtaset (CUDA) method aims to make data unlearnable by applying class-wise blurs to every image in the dataset so that neural networks learn relations between blur kernels and labels, as opposed to informative features for classifying clean data. In this work, we evaluate whether CUDA data remains unlearnable after image sharpening and frequency filtering, finding that this combination of simple transforms improves the utility of CUDA data for training. In particular, we observe a substantial increase in test accuracy over adversarial training for models trained with CUDA unlearnable data from CIFAR-10, CIFAR-100, and ImageNet-100. In training models to high accuracy using unlearnable data, we underscore the need for ongoing refinement in data poisoning techniques to ensure data privacy. Our method opens new avenues for enhancing the robustness of unlearnable datasets by highlighting that simple methods such as sharpening and frequency filtering are capable of breaking convolution-based unlearnable datasets.

Paper Structure

This paper contains 17 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Sharpening and Frequency Filtering of a CIFAR-10 Image. We analyze the effect of standard and random sharpening kernels, both with a center value of 2.5. We find that the randomized sharpening kernel, denoted RSK, ensures images of the same class are sharpened differently. After sharpening, we decompose the image into spatial frequencies using DCT and filter out high frequencies (see Section \ref{['subsection:frequency-filtering']}).
  • Figure 2: Comparison of the same image from different unlearnable datasets. OPS wu2023onepixel perturbs by adding noise to a singular pixel. AR sandoval-segura2022autoregressive generates perturbations using a sliding window approach. R4 sandoval2022poisons is a grid-like additive perturbation. CUDA and OPS, being unbounded methods, exhibit a noticeable perturbation, compared to AR and R4 which have perturbations bounded by an $\ell_p$-norm.
  • Figure 3: DCT can be used to remove exact frequency bands. DCT converts an image into a spatial frequency representation of the same size. The coefficients represent increasing frequencies from top-left (lowest) to bottom-right (highest). We retain the lowest $X$% of frequency coefficients by masking out the remaining higher frequencies. Then, we apply the inverse DCT (IDCT) to transform the modified frequency representation back into an image.
  • Figure 4: Left: A standard sharpening kernel, using a center value of 5. Middle: A softer sharpening kernel with a center value of 2.5. Right: A random sharpening kernel where values are sampled independently from a normal distribution.
  • Figure 5: Image reconstruction using different percentages of frequencies kept. On the very left, we retain the lowest 10% of DCT frequencies. We progressively preserve higher frequencies, resulting in a series of blurry to clear images. These images are constructed using an image from the clean dataset.