Online Sensitivity Optimization in Differentially Private Learning

Filippo Galli; Catuscia Palamidessi; Tommaso Cucinotta

Online Sensitivity Optimization in Differentially Private Learning

Filippo Galli, Catuscia Palamidessi, Tommaso Cucinotta

TL;DR

The paper tackles hyperparameter tuning in differentially private learning by focusing on the gradient clipping threshold $C$ in DP-SGD. It introduces OSO-DPSGD, which treats $C_t$ as a learnable parameter and updates it via gradient-informed exponential rules while preserving DP through Gaussian mechanisms. The approach provides a privacy-efficient alternative to grid search and demonstrates competitive performance across MNIST, FashionMNIST, and AG News against fixed-threshold and fixed-quantile strategies. Key contributions include deriving private updates for $C_t$, decoupling sensitivity from DP budgets, and validating the method across varying model sizes and privacy levels.

Abstract

Training differentially private machine learning models requires constraining an individual's contribution to the optimization process. This is achieved by clipping the $2$-norm of their gradient at a predetermined threshold prior to averaging and batch sanitization. This selection adversely influences optimization in two opposing ways: it either exacerbates the bias due to excessive clipping at lower values, or augments sanitization noise at higher values. The choice significantly hinges on factors such as the dataset, model architecture, and even varies within the same optimization, demanding meticulous tuning usually accomplished through a grid search. In order to circumvent the privacy expenses incurred in hyperparameter tuning, we present a novel approach to dynamically optimize the clipping threshold. We treat this threshold as an additional learnable parameter, establishing a clean relationship between the threshold and the cost function. This allows us to optimize the former with gradient descent, with minimal repercussions on the overall privacy analysis. Our method is thoroughly assessed against alternative fixed and adaptive strategies across diverse datasets, tasks, model dimensions, and privacy levels. Our results indicate that it performs comparably or better in the evaluated scenarios, given the same privacy requirements.

Online Sensitivity Optimization in Differentially Private Learning

TL;DR

The paper tackles hyperparameter tuning in differentially private learning by focusing on the gradient clipping threshold

in DP-SGD. It introduces OSO-DPSGD, which treats

as a learnable parameter and updates it via gradient-informed exponential rules while preserving DP through Gaussian mechanisms. The approach provides a privacy-efficient alternative to grid search and demonstrates competitive performance across MNIST, FashionMNIST, and AG News against fixed-threshold and fixed-quantile strategies. Key contributions include deriving private updates for

, decoupling sensitivity from DP budgets, and validating the method across varying model sizes and privacy levels.

Abstract

Training differentially private machine learning models requires constraining an individual's contribution to the optimization process. This is achieved by clipping the

-norm of their gradient at a predetermined threshold prior to averaging and batch sanitization. This selection adversely influences optimization in two opposing ways: it either exacerbates the bias due to excessive clipping at lower values, or augments sanitization noise at higher values. The choice significantly hinges on factors such as the dataset, model architecture, and even varies within the same optimization, demanding meticulous tuning usually accomplished through a grid search. In order to circumvent the privacy expenses incurred in hyperparameter tuning, we present a novel approach to dynamically optimize the clipping threshold. We treat this threshold as an additional learnable parameter, establishing a clean relationship between the threshold and the cost function. This allows us to optimize the former with gradient descent, with minimal repercussions on the overall privacy analysis. Our method is thoroughly assessed against alternative fixed and adaptive strategies across diverse datasets, tasks, model dimensions, and privacy levels. Our results indicate that it performs comparably or better in the evaluated scenarios, given the same privacy requirements.

Paper Structure (12 sections, 1 theorem, 17 equations, 11 figures, 13 tables, 1 algorithm)

This paper contains 12 sections, 1 theorem, 17 equations, 11 figures, 13 tables, 1 algorithm.

Introduction
Background
Related Works
Hyperparameter Optimization
Sensitivity Optimization
Method
Privacy Analysis
The OSO-DPSGD Algorithm
Experiments
Conclusion
Acknowledgments
Appendix - Models and Experiments

Key Result

Proposition 1

The Gaussian approximations $\tilde{q}_t$ and $\tilde{g}_t$ of $\sum_{z_i \in B_{t-1}} q_{t-1}(z_i)$ and $\sum_{z_i \in B_t} \bar{g}_t(z_i)$ with noise multipliers, respectively, $\nu_q$ and $\nu_g$, is equivalent (as far as privacy accounting is concerned) to the application of a single Gaussian me

Figures (11)

Figure 1: The choice of clipping threshold $C$ requires trading off a higher clipping bias at small values, for larger Gaussian noise at large values. Here the clipped, averaged, noised gradient of a CNN for character recognition is compared with the true average gradient at different training iterations $t \in \{100, 250, 500, 750, 950\}$. Note that for some values the sanitized gradient may even have components pointing in the opposite direction w.r.t the true gradient, corresponding to negative cosine similarity. The reported figure of cosine similarity is an average over $20$ realizations of the Gaussian mechanism.
Figure 2: The Pareto frontiers of the noise multipliers to sanitize $\tilde{g}_t$ and $\tilde{q}_t$, and the chosen values given the heuristic described in the Privacy Analysis section, at different privacy requirements. This particular instance comes from the MNIST experiments described in the Experiments section.
Figure 3: Accuracy on the MNIST dataset. Higher is better.
Figure 4: Mean Squared Error on the Fashion MNIST dataset. Lower is better. All runs for $\varepsilon=1$ of FixedQuantile result in a diverging optimization and are therefore not included.
Figure 5: Accuracy on the AG News dataset. Higher is better.
...and 6 more figures

Theorems & Definitions (2)

Definition 1: Differential Privacy dwork2006differential
Proposition 1

Online Sensitivity Optimization in Differentially Private Learning

TL;DR

Abstract

Online Sensitivity Optimization in Differentially Private Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (2)