Practical Differentially Private Hyperparameter Tuning with Subsampling

Antti Koskela; Tejas Kulkarni

Practical Differentially Private Hyperparameter Tuning with Subsampling

Antti Koskela, Tejas Kulkarni

TL;DR

The paper addresses the high privacy cost and computational burden of hyperparameter tuning for differentially private (DP) machine learning. It introduces a method that tunes hyperparameters on a small random subset and extrapolates to the full data, underpinned by a Rényi differential privacy analysis. The approach reduces both the DP budget and computational overhead, outperforming the Papernot and Steinke baseline in privacy-utility trade-offs for DP-SGD and DP-Adam across standard datasets. It provides grid-search with randomized hyperparameter selection and tailored privacy accounting bounds, with practical implications for scalable private learning.

Abstract

Tuning the hyperparameters of differentially private (DP) machine learning (ML) algorithms often requires use of sensitive data and this may leak private information via hyperparameter values. Recently, Papernot and Steinke (2022) proposed a certain class of DP hyperparameter tuning algorithms, where the number of random search samples is randomized itself. Commonly, these algorithms still considerably increase the DP privacy parameter $\varepsilon$ over non-tuned DP ML model training and can be computationally heavy as evaluating each hyperparameter candidate requires a new training run. We focus on lowering both the DP bounds and the computational cost of these methods by using only a random subset of the sensitive data for the hyperparameter tuning and by extrapolating the optimal values to a larger dataset. We provide a Rényi differential privacy analysis for the proposed method and experimentally show that it consistently leads to better privacy-utility trade-off than the baseline method by Papernot and Steinke.

Practical Differentially Private Hyperparameter Tuning with Subsampling

TL;DR

Abstract

over non-tuned DP ML model training and can be computationally heavy as evaluating each hyperparameter candidate requires a new training run. We focus on lowering both the DP bounds and the computational cost of these methods by using only a random subset of the sensitive data for the hyperparameter tuning and by extrapolating the optimal values to a larger dataset. We provide a Rényi differential privacy analysis for the proposed method and experimentally show that it consistently leads to better privacy-utility trade-off than the baseline method by Papernot and Steinke.

Paper Structure (23 sections, 9 theorems, 49 equations, 9 figures, 2 tables)

This paper contains 23 sections, 9 theorems, 49 equations, 9 figures, 2 tables.

Introduction
Related Work on Hyperparameter Tuning
Our Contributions
Background: DP, DP-SGD and DP Hyperparameter Tuning
DP Hyperparameter Tuning with a Random Subset
Our Method: Small Random Subset for Tuning
Extrapolating the DP-SGD Hyperparameters
Privacy Analysis
Computational Savings
Dealing with DP-SGD Hyperparameters that Affect the DP Guarantees
Grid Search with Randomization
RDP Analysis
Experimental Results
Discussion
Full Description of Experiments
...and 8 more sections

Key Result

Lemma 3

Suppose the mechanism $\mathcal{M}$ is $(\alpha,\varepsilon' )$-RDP. Then $\mathcal{M}$ is also $(\varepsilon,\delta(\varepsilon))$-DP for arbitrary $\varepsilon\geq 0$ with

Figures (9)

Figure 1: Comparison of $(\varepsilon,\delta)$-bounds for the variant 1 given in Equation \ref{['eq:composition1']} and the variant 2 given in Equation \ref{['eq:composition2']} as a function of the subsampling ratio $q$ used for sampling the tuning set $X_1$. Also shown is the $(\varepsilon,\delta)$-bound for the baseline algorithm described in Thm. \ref{['thm:main_rdp_poisson']}. Here $\mu$ refers to the expected number of model evaluations in the tuning algorithm.
Figure 2: Tuning learning rate with DP-SGD. Test accuracies are averaged across 10 independent runs and the error bars denote the standard error of the mean. The numbers in the legends refer to the mean training timings of the baseline scaled with respect to minimum of variant 1 and 2. For example, for CIFAR-10, the average training time for the baseline method is 6.06 times bigger than for the fastest of our methods. For perspective, we also add curves showing the privacy cost of training a single model with optimal hyperparameters obtained from the baseline.
Figure 3: Tuning learning rate with DP-Adam. Test accuracies are averaged across 10 independent runs and the error bars denote the standard error of the mean. The numbers in the legends refer to the mean training timings of the baseline scaled with respect to minimum of variant 1 and 2. For example, for FashionMNIST, the average training time for the baseline method is 9.03 times bigger than the fastest of our methods. For perspective, we also add curves showing the privacy cost of training a single model with optimal hyperparameters obtained from the baseline. Figure \ref{['fig:figure_tune_lr_detailed_dp_adam']} (Appendix) shows a more detailed version of this plot.
Figure 4: Tuning of subsampling ratio, training epochs, and learning rate with DP-SGD. Test accuracies are averaged across 10 independent runs and the error bars denote the standard error of the mean. The numbers in the legends refer to the mean timings of the baseline method scaled with respect to the minimum of variant 1 and 2. For perspective, we also add curves showing the privacy cost of training a single model with optimal hyperparameters obtained from the baseline. Figure \ref{['fig:figure_tune_all_detailed_dp_sgd']} (Appendix) shows a more detailed version of this plot.
Figure 5: Tuning only the learning rate with DP-SGD. Test accuracies are averaged across 10 independent runs and the error bars denote the standard error of the mean. The number in the legends in the first column refer to the scaled mean training timings for the baseline method with respect to the fastest of variant 1 and 2. The second column plots final $\varepsilon$ vs. mean $\sigma$. Our methods inject significantly smaller noise compared to the baseline for all $\varepsilon$ regimes. We also observe that due to tight analysis in Thm \ref{['thm:strategy1_rdp1']}, $\sigma$ for variant 1 is consistently lower than for variant 2. As a result, we see slightly higher accuracy for variant 1 in many cases. The third column plots final $\varepsilon$ vs. mean optimal $\eta$. Note that due to randomess in the candidate selection process, optimal $\eta$'s for all three methods need not be the same. For perspective, we also add curves showing the privacy cost of training a single model with optimal hyperparameters obtained from the baseline.
...and 4 more figures

Theorems & Definitions (17)

Definition 1
Definition 2
Lemma 3: canonne2020discrete
Theorem 4: zhu2019
Theorem 5: papernot2021
Theorem 6
Remark 7
Remark 8
Lemma 9
proof
...and 7 more

Practical Differentially Private Hyperparameter Tuning with Subsampling

TL;DR

Abstract

Practical Differentially Private Hyperparameter Tuning with Subsampling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (17)