Table of Contents
Fetching ...

DPZero: Private Fine-Tuning of Language Models without Backpropagation

Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil, Sewoong Oh, Niao He

TL;DR

This work tackles the dual challenges of memory-intensive backpropagation and data privacy in fine-tuning large language models. It introduces DPZero, a private zeroth-order optimization algorithm that decouples gradient direction (public) from magnitude (private), employs a tighter clipping strategy, and uses scalar privacy noise, achieving nearly dimension-independent rates under a low effective rank assumption. Theoretical results show that DPZero attains convergence rates tied to the intrinsic rank r, with rates scaling as roughly $ ilde{O}( rac{ ext{poly}( ext{problem parameters})}{n\, ext{ε}} \, imes \, ext{sqrt}(r \, ext{log}(e/ ext{δ})))$, significantly improving over naive dimension-dependent zeroth-order DP methods. Empirically, DPZero demonstrates memory efficiency and competitive accuracy on private fine-tuning tasks for RoBERTa and OPT, confirming its practicality for large-scale, privacy-preserving model adaptation.

Abstract

The widespread practice of fine-tuning large language models (LLMs) on domain-specific data faces two major challenges in memory and privacy. First, as the size of LLMs continues to grow, the memory demands of gradient-based training methods via backpropagation become prohibitively high. Second, given the tendency of LLMs to memorize training data, it is important to protect potentially sensitive information in the fine-tuning data from being regurgitated. Zeroth-order methods, which rely solely on forward passes, substantially reduce memory consumption during training. However, directly combining them with standard differentially private gradient descent suffers more as model size grows. To bridge this gap, we introduce DPZero, a novel private zeroth-order algorithm with nearly dimension-independent rates. The memory efficiency of DPZero is demonstrated in privately fine-tuning RoBERTa and OPT on several downstream tasks. Our code is available at https://github.com/Liang137/DPZero.

DPZero: Private Fine-Tuning of Language Models without Backpropagation

TL;DR

This work tackles the dual challenges of memory-intensive backpropagation and data privacy in fine-tuning large language models. It introduces DPZero, a private zeroth-order optimization algorithm that decouples gradient direction (public) from magnitude (private), employs a tighter clipping strategy, and uses scalar privacy noise, achieving nearly dimension-independent rates under a low effective rank assumption. Theoretical results show that DPZero attains convergence rates tied to the intrinsic rank r, with rates scaling as roughly , significantly improving over naive dimension-dependent zeroth-order DP methods. Empirically, DPZero demonstrates memory efficiency and competitive accuracy on private fine-tuning tasks for RoBERTa and OPT, confirming its practicality for large-scale, privacy-preserving model adaptation.

Abstract

The widespread practice of fine-tuning large language models (LLMs) on domain-specific data faces two major challenges in memory and privacy. First, as the size of LLMs continues to grow, the memory demands of gradient-based training methods via backpropagation become prohibitively high. Second, given the tendency of LLMs to memorize training data, it is important to protect potentially sensitive information in the fine-tuning data from being regurgitated. Zeroth-order methods, which rely solely on forward passes, substantially reduce memory consumption during training. However, directly combining them with standard differentially private gradient descent suffers more as model size grows. To bridge this gap, we introduce DPZero, a novel private zeroth-order algorithm with nearly dimension-independent rates. The memory efficiency of DPZero is demonstrated in privately fine-tuning RoBERTa and OPT on several downstream tasks. Our code is available at https://github.com/Liang137/DPZero.
Paper Structure (38 sections, 9 theorems, 104 equations, 3 figures, 11 tables, 2 algorithms)

This paper contains 38 sections, 9 theorems, 104 equations, 3 figures, 11 tables, 2 algorithms.

Key Result

Lemma 2.2

Let $\mathcal{A}$ be some randomized algorithm operating on a dataset $S$ and outputting a vector in $\mathbb{R}^d$. If $\mathcal{A}$ has sensitivity $\Delta:=\sup_{S \sim S'} \lVert\mathcal{A}(S) - \mathcal{A}(S')\rVert$, the mechanism that adds Gaussian noise $\mathcal{N}(0,\sigma^2\mathrm{I}_d)$

Figures (3)

  • Figure 1: Experiments on the quadratic loss with effective rank $\,\text{Tr}(A)$ (Assumption \ref{['asp:rank']}). For three different modes of the effective rank, we demonstrate how the norm of the train ((a), (b), and (c)) and test ((d), (e), and (f)) gradient depends on the problem dimension. DPGD-0th (Algorithm \ref{['algo:d-dependent']}) has a strong dimension dependence regardless of the effective rank, while DPZero (Algorithm \ref{['algo:d-free']}) achieves dimension-independent performance when effective rank is small (right panel), similar to the standard first-order method DP-GD. Insights for the saturation of DPGD-0th when the dimension increases can be found in Remark \ref{['rmk:bound']}.
  • Figure 2: Experiments on the quadratic loss with effective rank $\,\text{Tr}(A)$. For three different modes, we increase the dimension and report the best loss evaluated on both training set ((a), (b), and (c)) and test set ((d), (e), and (f)).
  • Figure 3: Experiments on private fine-tuning RoBERTa (125M) for SNLI with DPZero. (a) (Smoothed) training curves when fixing the stepsize to be $5\times 10^{-6}$ and varying the clipping threshold from 1 to 500. In the choice of clipping, a tradeoff emerges; larger clipping values result in unnecessarily high privacy noise, while smaller values can induce increased bias in the optimization process. (b) and (c) Test loss and accuracy (%) when varying the stepsize and clipping threshold together. Consistent with first-order methods li2022large, we observe that larger clipping necessitates smaller stepsizes, whereas smaller clipping favors larger stepsizes.

Theorems & Definitions (27)

  • Definition 2.1: Differential Privacy dwork2006calibratingdwork2014algorithmic
  • Lemma 2.2: Advanced Composition
  • Theorem 1
  • Remark 3.2
  • Remark 3.3
  • Remark 3.4
  • Theorem 2
  • Remark 3.6
  • Theorem 3
  • Remark 4.1
  • ...and 17 more