Table of Contents
Fetching ...

Differentially Private Active Learning: Balancing Effective Data Selection and Privacy

Kristian Schwethelm, Johannes Kaiser, Jonas Kuntzer, Mehmet Yigitsoy, Daniel Rueckert, Georgios Kaissis

TL;DR

This paper tackles the challenge of combining active learning with differential privacy in standard pool-based learning by introducing differentially private active learning (DP-AL). The core innovation is Step Amplification, which rebalances the DP budget across training phases to maximize data utilization, paired with a joint privacy accounting framework for DP-SGD and selection. Empirical results on vision and NLP tasks show that DP-AL with uncertainty-based acquisition can outperform random DP-SGD under privacy constraints, though gains are dataset- and budget-dependent, and some acquisition strategies remain impractical under strict DP. Overall, DP-AL offers a meaningful, if constrained, path to reducing labeling costs in privacy-sensitive domains, highlighting necessary trade-offs between privacy, data selection accuracy, and model performance, and pointing to extensions to other DP-training paradigms.

Abstract

Active learning (AL) is a widely used technique for optimizing data labeling in machine learning by iteratively selecting, labeling, and training on the most informative data. However, its integration with formal privacy-preserving methods, particularly differential privacy (DP), remains largely underexplored. While some works have explored differentially private AL for specialized scenarios like online learning, the fundamental challenge of combining AL with DP in standard learning settings has remained unaddressed, severely limiting AL's applicability in privacy-sensitive domains. This work addresses this gap by introducing differentially private active learning (DP-AL) for standard learning settings. We demonstrate that naively integrating DP-SGD training into AL presents substantial challenges in privacy budget allocation and data utilization. To overcome these challenges, we propose step amplification, which leverages individual sampling probabilities in batch creation to maximize data point participation in training steps, thus optimizing data utilization. Additionally, we investigate the effectiveness of various acquisition functions for data selection under privacy constraints, revealing that many commonly used functions become impractical. Our experiments on vision and natural language processing tasks show that DP-AL can improve performance for specific datasets and model architectures. However, our findings also highlight the limitations of AL in privacy-constrained environments, emphasizing the trade-offs between privacy, model accuracy, and data selection accuracy.

Differentially Private Active Learning: Balancing Effective Data Selection and Privacy

TL;DR

This paper tackles the challenge of combining active learning with differential privacy in standard pool-based learning by introducing differentially private active learning (DP-AL). The core innovation is Step Amplification, which rebalances the DP budget across training phases to maximize data utilization, paired with a joint privacy accounting framework for DP-SGD and selection. Empirical results on vision and NLP tasks show that DP-AL with uncertainty-based acquisition can outperform random DP-SGD under privacy constraints, though gains are dataset- and budget-dependent, and some acquisition strategies remain impractical under strict DP. Overall, DP-AL offers a meaningful, if constrained, path to reducing labeling costs in privacy-sensitive domains, highlighting necessary trade-offs between privacy, data selection accuracy, and model performance, and pointing to extensions to other DP-training paradigms.

Abstract

Active learning (AL) is a widely used technique for optimizing data labeling in machine learning by iteratively selecting, labeling, and training on the most informative data. However, its integration with formal privacy-preserving methods, particularly differential privacy (DP), remains largely underexplored. While some works have explored differentially private AL for specialized scenarios like online learning, the fundamental challenge of combining AL with DP in standard learning settings has remained unaddressed, severely limiting AL's applicability in privacy-sensitive domains. This work addresses this gap by introducing differentially private active learning (DP-AL) for standard learning settings. We demonstrate that naively integrating DP-SGD training into AL presents substantial challenges in privacy budget allocation and data utilization. To overcome these challenges, we propose step amplification, which leverages individual sampling probabilities in batch creation to maximize data point participation in training steps, thus optimizing data utilization. Additionally, we investigate the effectiveness of various acquisition functions for data selection under privacy constraints, revealing that many commonly used functions become impractical. Our experiments on vision and natural language processing tasks show that DP-AL can improve performance for specific datasets and model architectures. However, our findings also highlight the limitations of AL in privacy-constrained environments, emphasizing the trade-offs between privacy, model accuracy, and data selection accuracy.
Paper Structure (63 sections, 7 theorems, 17 equations, 9 figures, 6 tables, 6 algorithms)

This paper contains 63 sections, 7 theorems, 17 equations, 9 figures, 6 tables, 6 algorithms.

Key Result

Theorem 1

The Laplace mechanism guarantees $\varepsilon$-DP for $\beta \geq \mathrm{\Delta}_{1}/\varepsilon$.

Figures (9)

  • Figure 1: Overview of the iterative active learning process. First, the model is trained on the labeled dataset $\mathcal{D}$. Then, an acquisition function uses the current model to select the most informative samples from the unlabeled dataset ($\mathcal{Q} \subseteq \mathcal{U}$), which are labeled and added to the training dataset. The AL process exposes two privacy vulnerabilities: (1) as discussed in DP literature, an adversary may use the model and training gradients to infer private information, and (2) unique to AL, an adversary could exploit the results of the acquisition function to infer the presence of specific samples in the dataset.
  • Figure 2: Privacy loss across training phases for the naive (left) and step amplification DP-AL method (right) with a total privacy budget of $\varepsilon=8$. $\mathcal{K}_i$ denotes the group of samples added to the training dataset prior to phase $i$. The figure shows that each training phase expends a different amount of privacy due to the change in sampling probabilities. In step amplification, contrary to the naive approach, all data points consume their full privacy budget.
  • Figure 3: Sampling probabilities (left) and number of training steps (right) across training phases for the naive and step amplification (SA) DP-AL method. $\mathcal{G}_{\text{old}}$ denotes the group of samples in the labeled dataset at phase $i-1$ and $\mathcal{G}_{\text{new}}$ the new samples added at phase $i$.
  • Figure 4: Privacy loss across training phases for our DP-AL implementation with a total privacy budget of $\varepsilon=8$ and a privacy budget from selection of $\varepsilon_{\text{Sel}}=2$. $\mathcal{K}_i$ denotes the group of samples added to the training dataset prior to phase $i$. Newly added samples already use some of the privacy budget before training, due to the privacy leakage from the selection phases they where used in.
  • Figure 5: Performance comparison of entropy sampling on CIFAR-10 for different privacy budgets for selection $\varepsilon_{Sel}$ under an overall privacy budget of $\varepsilon=8$. As baselines we show standard training with a random subset and DP-AL without privatization of the selection phases. We report average $\pm$ standard deviation across 5 runs.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Definition 1: ($\varepsilon,\delta$)-Differential Privacy
  • Theorem 1: Laplace mechanism
  • Theorem 2: Gaussian mechanism
  • Theorem 3: Post-Processing
  • Theorem 4: Basic Sequential Composition
  • Theorem 5: RDP Composition
  • Theorem 6: Parallel Composition
  • Theorem 7: Privacy Amplification by Subsampling