Table of Contents
Fetching ...

Private Fine-tuning of Large Language Models with Zeroth-order Optimization

Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, Prateek Mittal

TL;DR

This work tackles private fine-tuning of large language models by replacing gradient-based DP training with zeroth-order optimization that privatizes only a scalar step size. The DP-ZO method uses SPSA-like updates, Poisson subsampling, and either Gaussian or Laplace noise to ensure $(\varepsilon,\delta)$-DP, achieving memory efficiency and scalability to models up to 66B parameters. Empirical results across SQuAD, DROP, and SST2 show DP-ZO can match DP-SGD performance at similar model sizes and enable nontrivial pure $\varepsilon$-DP utility, with notable memory advantages, especially for long sequences. The approach offers a practical, scalable pathway for privacy-preserving fine-tuning of foundation models, with potential extensions to other domains and DP mechanisms.

Abstract

Differentially private stochastic gradient descent (DP-SGD) allows models to be trained in a privacy-preserving manner, but has proven difficult to scale to the era of foundation models. We introduce DP-ZO, a private fine-tuning framework for large language models by privatizing zeroth order optimization methods. A key insight into the design of our method is that the direction of the gradient in the zeroth-order optimization we use is random and the only information from training data is the step size, i.e., a scalar. Therefore, we only need to privatize the scalar step size, which is memory-efficient. DP-ZO provides a strong privacy-utility trade-off across different tasks, and model sizes that are comparable to DP-SGD in $(\varepsilon,δ)$-DP. Notably, DP-ZO possesses significant advantages over DP-SGD in memory efficiency, and obtains higher utility in $\varepsilon$-DP when using the Laplace mechanism.

Private Fine-tuning of Large Language Models with Zeroth-order Optimization

TL;DR

This work tackles private fine-tuning of large language models by replacing gradient-based DP training with zeroth-order optimization that privatizes only a scalar step size. The DP-ZO method uses SPSA-like updates, Poisson subsampling, and either Gaussian or Laplace noise to ensure -DP, achieving memory efficiency and scalability to models up to 66B parameters. Empirical results across SQuAD, DROP, and SST2 show DP-ZO can match DP-SGD performance at similar model sizes and enable nontrivial pure -DP utility, with notable memory advantages, especially for long sequences. The approach offers a practical, scalable pathway for privacy-preserving fine-tuning of foundation models, with potential extensions to other domains and DP mechanisms.

Abstract

Differentially private stochastic gradient descent (DP-SGD) allows models to be trained in a privacy-preserving manner, but has proven difficult to scale to the era of foundation models. We introduce DP-ZO, a private fine-tuning framework for large language models by privatizing zeroth order optimization methods. A key insight into the design of our method is that the direction of the gradient in the zeroth-order optimization we use is random and the only information from training data is the step size, i.e., a scalar. Therefore, we only need to privatize the scalar step size, which is memory-efficient. DP-ZO provides a strong privacy-utility trade-off across different tasks, and model sizes that are comparable to DP-SGD in -DP. Notably, DP-ZO possesses significant advantages over DP-SGD in memory efficiency, and obtains higher utility in -DP when using the Laplace mechanism.
Paper Structure (29 sections, 7 theorems, 9 equations, 6 figures, 21 tables, 2 algorithms)

This paper contains 29 sections, 7 theorems, 9 equations, 6 figures, 21 tables, 2 algorithms.

Key Result

Proposition 2.2

For any function $f : \mathbb{X}^n \rightarrow \mathbb{R}$ with $l_2$ sensitivity $\Delta$, the mechanism defined as where $z \sim \mathcal{N}\left(0,\sigma^2\right)$, provides $(\varepsilon, \delta)$-DP where $\Phi(\frac{\Delta}{2\sigma}-\frac{\varepsilon\sigma}{\Delta})-e^{\varepsilon}\Phi(-\frac{\Delta}{2\sigma}-\frac{\varepsilon\sigma}{\Delta}))\leq \delta$. $\Phi(t)$ is the cumulative distri

Figures (6)

  • Figure 1: Visualization of DP-ZO. The only information from private data is a scalar step size for direction with lower target function value and we only need to add noise to this scalar. This scalar privatization enjoys the benefits of flexibility with DP mechanisms, ease of implementation, and reduced computation.
  • Figure 2: DP-ZO provides a strong privacy-utility trade-off across different tasks under conservative privacy budgets. F1 is for SQuAD and DROP and accuracy is for SST2.
  • Figure 3: DP-ZO achieves comparable performance as DP-SGD with same model size and scales seamlessly to large models like 30B/66B, that are challenging for DP-SGD.
  • Figure 4: DP-ZO achieves non-trivial performance for $\varepsilon$-DP. In contrast, DP-SGD (laplace) suffers to improve upon $\varepsilon=0$ (zero-shot) due to high variance.
  • Figure 5: Memory comparison of DP-ZO and DP-SGD with half-precision and gradient checkpointing. Batch size=1, gradient accumulation steps=2.
  • ...and 1 more figures

Theorems & Definitions (14)

  • Definition 2.1: $(\varepsilon,\delta)-$ Differential Privacy (DP)
  • Proposition 2.2: Gaussian mechanism dwork2014algorithmicballe2018improving
  • Proposition 2.3: Laplace mechanism dwork2014algorithmic
  • Definition 2.4: Simultaneous Perturbation Stochastic Approximation (SPSA) Spall1992MultivariateSA
  • Theorem 3.1
  • Proposition A.1: Basic Composition theorem dwork2014algorithmic
  • Proposition A.2: Privacy Amplification via Subsampling balle2018subsampleamplify
  • Definition A.3: Hockey-stick Divergence
  • Lemma A.4
  • Definition A.5: Optimal Privacy Curve
  • ...and 4 more