Table of Contents
Fetching ...

Scale-up Unlearnable Examples Learning with High-Performance Computing

Yanfan Zhu, Issac Lyngaas, Murali Gopalakrishnan Meena, Mary Ellen I. Koran, Bradley Malin, Daniel Moyer, Shunxing Bao, Anuj Kapadia, Xiao Wang, Bennett Landman, Yuankai Huo

TL;DR

The paper addresses privacy risks from uncontrolled data collection in healthcare imaging and proposes Unlearnable Examples (UEs) and Unlearnable Clusters (UCs) as mechanisms to render data resistant to unauthorized learning. It scales UC learning with Distributed Data Parallel (DDP) on the Summit HPC to systematically study how batch size influences unlearnability across diverse imaging datasets. Key contributions include a DDP-optimized UC architecture, a ResNetWithFeature surrogate that maintains synchronized updates in distributed training, and public code for replication, validated on six datasets drawn from MedMNist (Pets, Flowers, Flowers102, PathMNist, BloodMNist, OrganMNist-A). Results show that both very large and very small batch sizes can destabilize learning and that optimal batch-size configurations are highly dataset-dependent. The work provides practical guidance for dataset-specific batch-size strategies to enhance data protection and demonstrates the feasibility of large-scale UE evaluation using HPC.

Abstract

Recent advancements in AI models are structured to retain user interactions, which could inadvertently include sensitive healthcare data. In the healthcare field, particularly when radiologists use AI-driven diagnostic tools hosted on online platforms, there is a risk that medical imaging data may be repurposed for future AI training without explicit consent, spotlighting critical privacy and intellectual property concerns around healthcare data usage. Addressing these privacy challenges, a novel approach known as Unlearnable Examples (UEs) has been introduced, aiming to make data unlearnable to deep learning models. A prominent method within this area, called Unlearnable Clustering (UC), has shown improved UE performance with larger batch sizes but was previously limited by computational resources. To push the boundaries of UE performance with theoretically unlimited resources, we scaled up UC learning across various datasets using Distributed Data Parallel (DDP) training on the Summit supercomputer. Our goal was to examine UE efficacy at high-performance computing (HPC) levels to prevent unauthorized learning and enhance data security, particularly exploring the impact of batch size on UE's unlearnability. Utilizing the robust computational capabilities of the Summit, extensive experiments were conducted on diverse datasets such as Pets, MedMNist, Flowers, and Flowers102. Our findings reveal that both overly large and overly small batch sizes can lead to performance instability and affect accuracy. However, the relationship between batch size and unlearnability varied across datasets, highlighting the necessity for tailored batch size strategies to achieve optimal data protection. Our results underscore the critical role of selecting appropriate batch sizes based on the specific characteristics of each dataset to prevent learning and ensure data security in deep learning applications.

Scale-up Unlearnable Examples Learning with High-Performance Computing

TL;DR

The paper addresses privacy risks from uncontrolled data collection in healthcare imaging and proposes Unlearnable Examples (UEs) and Unlearnable Clusters (UCs) as mechanisms to render data resistant to unauthorized learning. It scales UC learning with Distributed Data Parallel (DDP) on the Summit HPC to systematically study how batch size influences unlearnability across diverse imaging datasets. Key contributions include a DDP-optimized UC architecture, a ResNetWithFeature surrogate that maintains synchronized updates in distributed training, and public code for replication, validated on six datasets drawn from MedMNist (Pets, Flowers, Flowers102, PathMNist, BloodMNist, OrganMNist-A). Results show that both very large and very small batch sizes can destabilize learning and that optimal batch-size configurations are highly dataset-dependent. The work provides practical guidance for dataset-specific batch-size strategies to enhance data protection and demonstrates the feasibility of large-scale UE evaluation using HPC.

Abstract

Recent advancements in AI models are structured to retain user interactions, which could inadvertently include sensitive healthcare data. In the healthcare field, particularly when radiologists use AI-driven diagnostic tools hosted on online platforms, there is a risk that medical imaging data may be repurposed for future AI training without explicit consent, spotlighting critical privacy and intellectual property concerns around healthcare data usage. Addressing these privacy challenges, a novel approach known as Unlearnable Examples (UEs) has been introduced, aiming to make data unlearnable to deep learning models. A prominent method within this area, called Unlearnable Clustering (UC), has shown improved UE performance with larger batch sizes but was previously limited by computational resources. To push the boundaries of UE performance with theoretically unlimited resources, we scaled up UC learning across various datasets using Distributed Data Parallel (DDP) training on the Summit supercomputer. Our goal was to examine UE efficacy at high-performance computing (HPC) levels to prevent unauthorized learning and enhance data security, particularly exploring the impact of batch size on UE's unlearnability. Utilizing the robust computational capabilities of the Summit, extensive experiments were conducted on diverse datasets such as Pets, MedMNist, Flowers, and Flowers102. Our findings reveal that both overly large and overly small batch sizes can lead to performance instability and affect accuracy. However, the relationship between batch size and unlearnability varied across datasets, highlighting the necessity for tailored batch size strategies to achieve optimal data protection. Our results underscore the critical role of selecting appropriate batch sizes based on the specific characteristics of each dataset to prevent learning and ensure data security in deep learning applications.
Paper Structure (1 section, 1 figure)

This paper contains 1 section, 1 figure.

Table of Contents

  1. INTRODUCTION

Figures (1)

  • Figure 1: Overview of the unlearnable methods applied to protect private pathology data. The upper section illustrates the traditional pipeline where original private pathology data is used, potentially leading to data breaches. The lower section demonstrates the application of unlearnable methods, which generate unlearnable pathology data to enhance data security and protect against unauthorized model learning.