Table of Contents
Fetching ...

Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning

Qi Li, Cheng-Long Wang, Yinzhi Cao, Di Wang

TL;DR

This work introduces a new task called Data-Centric Membership Inference and proposes the first ever data-centric privacy inference paradigm named Data Lineage Inference (DaLI), and introduces a metric called the Brimming score to offer guidance for selecting pruning methods with privacy protection in mind.

Abstract

In this work, we systematically explore the data privacy issues of dataset pruning in machine learning systems. Our findings reveal, for the first time, that even if data in the redundant set is solely used before model training, its pruning-phase membership status can still be detected through attacks. Since this is a fully upstream process before model training, traditional model output-based privacy inference methods are completely unsuitable. To address this, we introduce a new task called Data-Centric Membership Inference and propose the first ever data-centric privacy inference paradigm named Data Lineage Inference (DaLI). Under this paradigm, four threshold-based attacks are proposed, named WhoDis, CumDis, ArraDis and SpiDis. We show that even without access to downstream models, adversaries can accurately identify the redundant set with only limited prior knowledge. Furthermore, we find that different pruning methods involve varying levels of privacy leakage, and even the same pruning method can present different privacy risks at different pruning fractions. We conducted an in-depth analysis of these phenomena and introduced a metric called the Brimming score to offer guidance for selecting pruning methods with privacy protection in mind.

Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning

TL;DR

This work introduces a new task called Data-Centric Membership Inference and proposes the first ever data-centric privacy inference paradigm named Data Lineage Inference (DaLI), and introduces a metric called the Brimming score to offer guidance for selecting pruning methods with privacy protection in mind.

Abstract

In this work, we systematically explore the data privacy issues of dataset pruning in machine learning systems. Our findings reveal, for the first time, that even if data in the redundant set is solely used before model training, its pruning-phase membership status can still be detected through attacks. Since this is a fully upstream process before model training, traditional model output-based privacy inference methods are completely unsuitable. To address this, we introduce a new task called Data-Centric Membership Inference and propose the first ever data-centric privacy inference paradigm named Data Lineage Inference (DaLI). Under this paradigm, four threshold-based attacks are proposed, named WhoDis, CumDis, ArraDis and SpiDis. We show that even without access to downstream models, adversaries can accurately identify the redundant set with only limited prior knowledge. Furthermore, we find that different pruning methods involve varying levels of privacy leakage, and even the same pruning method can present different privacy risks at different pruning fractions. We conducted an in-depth analysis of these phenomena and introduced a metric called the Brimming score to offer guidance for selecting pruning methods with privacy protection in mind.

Paper Structure

This paper contains 50 sections, 33 equations, 97 figures, 6 tables.

Figures (97)

  • Figure 1: A typical machine learning system, among which the privacy risks of data-centric operations are overlooked.
  • Figure 2: Model Confidence under Random Selection.
  • Figure 3: Model Confidence under Dataset Pruning.
  • Figure 5: A comparison between traditional membership inference and our proposed data-centric membership inference: the former focuses on inferring the training-phase membership status of the selected set, while the latter targets the pruning-phase membership status of the redundant set.
  • Figure 6: Overall pipeline of DaLI.
  • ...and 92 more figures

Theorems & Definitions (3)

  • Definition 1: Occurrence Distribution of Victim Datapool
  • Definition 2: Occurrence Distribution of Shadow Datapool
  • Definition 3: CDF and CCDF of Shadow Datapool