Table of Contents
Fetching ...

A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization

Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, Prateek Mittal

TL;DR

The paper tackles the open problem of hyperparameter optimization under differential privacy for DP-SGD by introducing a private adaptive HPO method built around a new linear scaling rule. By first estimating optimal hyperparameters at cheap privacy budgets and then linearly scaling them to higher budgets, the approach dramatically reduces the privacy cost and computational burden of HPO while maintaining or improving utility. The authors provide a theoretical analysis of private gradient descent, reduce the HPO search to a one-dimensional radius r = η × T, and implement a private HPO procedure that privately extrapolates r(ε) and decomposes it into practical hyperparameters. Empirically, the method achieves state-of-the-art or competitive results across 22 CV/NLP tasks, including language modeling, with rigorous privacy accounting and demonstrated robustness to distribution shifts. This work meaningfully advances practical private training by enabling efficient, privacy-aware hyperparameter tuning that scales across tasks and privacy levels.

Abstract

An open problem in differentially private deep learning is hyperparameter optimization (HPO). DP-SGD introduces new hyperparameters and complicates existing ones, forcing researchers to painstakingly tune hyperparameters with hundreds of trials, which in turn makes it impossible to account for the privacy cost of HPO without destroying the utility. We propose an adaptive HPO method that uses cheap trials (in terms of privacy cost and runtime) to estimate optimal hyperparameters and scales them up. We obtain state-of-the-art performance on 22 benchmark tasks, across computer vision and natural language processing, across pretraining and finetuning, across architectures and a wide range of $\varepsilon \in [0.01,8.0]$, all while accounting for the privacy cost of HPO.

A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization

TL;DR

The paper tackles the open problem of hyperparameter optimization under differential privacy for DP-SGD by introducing a private adaptive HPO method built around a new linear scaling rule. By first estimating optimal hyperparameters at cheap privacy budgets and then linearly scaling them to higher budgets, the approach dramatically reduces the privacy cost and computational burden of HPO while maintaining or improving utility. The authors provide a theoretical analysis of private gradient descent, reduce the HPO search to a one-dimensional radius r = η × T, and implement a private HPO procedure that privately extrapolates r(ε) and decomposes it into practical hyperparameters. Empirically, the method achieves state-of-the-art or competitive results across 22 CV/NLP tasks, including language modeling, with rigorous privacy accounting and demonstrated robustness to distribution shifts. This work meaningfully advances practical private training by enabling efficient, privacy-aware hyperparameter tuning that scales across tasks and privacy levels.

Abstract

An open problem in differentially private deep learning is hyperparameter optimization (HPO). DP-SGD introduces new hyperparameters and complicates existing ones, forcing researchers to painstakingly tune hyperparameters with hundreds of trials, which in turn makes it impossible to account for the privacy cost of HPO without destroying the utility. We propose an adaptive HPO method that uses cheap trials (in terms of privacy cost and runtime) to estimate optimal hyperparameters and scales them up. We obtain state-of-the-art performance on 22 benchmark tasks, across computer vision and natural language processing, across pretraining and finetuning, across architectures and a wide range of , all while accounting for the privacy cost of HPO.
Paper Structure (61 sections, 5 theorems, 21 equations, 20 figures, 20 tables, 3 algorithms)

This paper contains 61 sections, 5 theorems, 21 equations, 20 figures, 20 tables, 3 algorithms.

Key Result

Proposition 2.1

If we are taking T steps with noise $\sigma$ and learning rate $\eta$ to achieve a target $\varepsilon^{*}$, we can achieve a target $\hat{\varepsilon}>\varepsilon^{*}$ by either: a) Fix T, reduce $\sigma$, increase $\eta$) b) Increase T, fix $\sigma$, fix $\eta$ c) Increase T slightly, reduce $\sig

Figures (20)

  • Figure 1: Visualization of our method. We use low-cost trials (small $\varepsilon$) to estimate hyperparameters (HPs) and scale these up to the privacy budget for the final run. We combine multiple HPs together, and have a prior that the scaling is linear.
  • Figure 2: Evaluation on ImageNet-1k finetuning. Our HPO only requires paying the privacy cost once, and can then be used to find good HPs for all values of $\varepsilon>0.5$. We outperform prior work mehtadptransferberrada2023unlocking because our HPO finds better HPs, even though prior work has better non-private performance and does not report the privacy cost of their HPO.
  • Figure 3: Training the beit architecture on CIFAR100, the linear scaling rule produces values for $r = \eta \times T$ close to that of grid search, and the performance drop is only apparent at $\varepsilon>0.2$ because of the cost of HPO, and vanishingly small for larger $\varepsilon$.
  • Figure 4: The linear scaling rule (accounting for the privacy cost of hyperparameter tuning) is competitive with grid search (non-private, doing N trials each with the given $\varepsilon$) on the Enron Emails dataset. Left: y-axis is Perplexity (lower is better).
  • Figure 5: Heatmaps for beit on CIFAR100. $\varepsilon$ increases from $0.05 \rightarrow 1.0$ left to right on the grid-axis, iterations $T$ increases from $5 \rightarrow 100$ left to right on the individual plot axis, and the learning rate $\eta$ increases from $0.05 \downarrow 1$ top to bottom on the individual plot axis. As $\varepsilon$ increases, left to right, the optimal value of $\eta \times T$ increases in accordance with the new linear scaling rule. Prior work has generally operated in the top-left regime, that is often suboptimal.
  • ...and 15 more figures

Theorems & Definitions (10)

  • Definition 1.1: Differential Privacy
  • Proposition 2.1
  • Theorem 3.1
  • Theorem 4.1
  • Proposition 2.1
  • Corollary 2.2
  • proof
  • proof
  • proof
  • Example 2.3: Computing the Lipschitz constant for single-layer SGD training ( sparsefed)