Table of Contents
Fetching ...

Pre-training Differentially Private Models with Limited Public Data

Zhiqi Bu, Xinwei Zhang, Mingyi Hong, Sheng Zha, George Karypis

TL;DR

It is made a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy, which leads to a novel DP pre-trained models.

Abstract

The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage, due to the performance degradation when applying DP during the pre-training stage. Consequently, DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we first provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration loss improvement. We make a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10\% of public data, our strategy can achieve DP accuracy of 41.5\% on ImageNet-21k (with $ε=8$), as well as non-DP accuracy of 55.7\% and and 60.0\% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pre-training and substantially outperforming existing DP pre-trained models. Our DP pre-trained models are released in fastDP library (https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1)

Pre-training Differentially Private Models with Limited Public Data

TL;DR

It is made a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy, which leads to a novel DP pre-trained models.

Abstract

The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage, due to the performance degradation when applying DP during the pre-training stage. Consequently, DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we first provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration loss improvement. We make a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10\% of public data, our strategy can achieve DP accuracy of 41.5\% on ImageNet-21k (with ), as well as non-DP accuracy of 55.7\% and and 60.0\% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pre-training and substantially outperforming existing DP pre-trained models. Our DP pre-trained models are released in fastDP library (https://github.com/awslabs/fast-differential-privacy/releases/tag/v2.1)
Paper Structure (54 sections, 1 theorem, 50 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 54 sections, 1 theorem, 50 equations, 9 figures, 8 tables, 1 algorithm.

Key Result

Lemma A.1

Given an iterative algorithm with $\ell_2$ sensitivity $1$ at each iteration, which uniformly samples the data in dataset of size $n$ with ratio $\frac{B}{n}$, by injecting Gaussian noise ${\mathcal{N}}(0,\sigma^2{\mathbf{I}})$ to the output of the algorithm at each iteration, it satisfies $\mu$-GDP where $S$ denotes the fixed computation budget.

Figures (9)

  • Figure 1: Comparison among the convergence of standard SGD, clipped SGD without noise, noisy SGD without clipping, and DP-SGD in different tasks and training stages.
  • Figure 2: Summary of results in \ref{['sec:experiments']}. First three figures compare the downstream and few-shot performance and the data efficiency (circle's radius proportional to pre-training data size) of the DP pre-trained models; the last figure shows the performance of DP pre-trained models defending against privacy attacks (lower is stronger in defense).
  • Figure 3: Noise levels by privacy accountants.
  • Figure 4: Per-sample gradient clipping in \ref{['footnote:clip']}.
  • Figure 5: Illustration of different terms in \ref{['eq:priv loss improv']} and \ref{['eq:pub loss improv']}. Left sub-plots depict the denominators in \ref{['eq:priv loss improv']} and \ref{['eq:pub loss improv']}. Right sub-plots depict the whole terms and optimal batch sizes.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Definition 1.1: dwork2006calibratingdong2019gaussian
  • Remark 2.2
  • Remark 3.2
  • Remark 4.1
  • Remark 4.2
  • Lemma A.1
  • proof