Table of Contents
Fetching ...

Differentially Private Zeroth-Order Methods for Scalable Large Language Model Finetuning

Z Liu, J Lou, W Bao, Y Hu, B Li, Z Qin, K Ren

TL;DR

This work addresses the privacy-utility-scalability tradeoff in fine-tuning pretrained LLMs by moving beyond DP-SGD to differentially private zeroth-order methods. It introduces DP-ZOSO, a stagewise zeroth-order fine-tuner, and DP-ZOPO, a pruning-enabled variant that focuses updates on important directions via data-free pruning and dynamic masking. The authors provide theoretical analyses of privacy guarantees and convergence, and demonstrate strong empirical gains across encoder-only and decoder-only models on diverse tasks, often approaching full-parameter DP fine-tuning with far lower memory consumption. Dynamic ZO scale scheduling and stagewise optimization further stabilize training, while pruning strategies (static, dynamic, incremental) substantially boost utility under DP. Overall, the proposed DP-ZO family offers memory-efficient, scalable, and high-utility DP fine-tuning for large language models with broad applicability to classification, QA, translation, and summarization tasks.

Abstract

Fine-tuning on task-specific datasets is a widely-embraced paradigm of harnessing the powerful capability of pretrained LLMs for various downstream tasks. Due to the popularity of LLMs fine-tuning and its accompanying privacy concerns, differentially private (DP) fine-tuning of pretrained LLMs has been widely used to safeguarding the privacy of task-specific datasets. Lying at the design core of DP LLM fine-tuning methods is the satisfactory tradeoff among privacy, utility, and scalability. Most existing methods build upon the seminal work of DP-SGD. Despite pushing the scalability of DP-SGD to its limit, DP-SGD-based fine-tuning methods are unfortunately limited by the inherent inefficiency of SGD. In this paper, we investigate the potential of DP zeroth-order methods for LLM pretraining, which avoids the scalability bottleneck of SGD by approximating the gradient with the more efficient zeroth-order gradient. Rather than treating the zeroth-order method as a drop-in replacement for SGD, this paper presents a comprehensive study both theoretically and empirically. First, we propose the stagewise DP zeroth-order method (DP-ZOSO) that dynamically schedules key hyperparameters. This design is grounded on the synergy between DP random perturbation and the gradient approximation error of the zeroth-order method, and its effect on fine-tuning trajectory. We provide theoretical analysis for both proposed methods. We conduct extensive empirical analysis on both encoder-only masked language model and decoder-only autoregressive language model, achieving impressive results in terms of scalability and utility regardless of the class of tasks (compared with DPZero, DP-ZOPO improves $4.5\%$ on SST-5, $5.5\%$ on MNLI with RoBERTa-Large and 9.2\% on CB, 3.9\% on BoolQ with OPT-2.7b when $ε=4$, demonstrates more significant enhancement in performance on more complicated tasks).

Differentially Private Zeroth-Order Methods for Scalable Large Language Model Finetuning

TL;DR

This work addresses the privacy-utility-scalability tradeoff in fine-tuning pretrained LLMs by moving beyond DP-SGD to differentially private zeroth-order methods. It introduces DP-ZOSO, a stagewise zeroth-order fine-tuner, and DP-ZOPO, a pruning-enabled variant that focuses updates on important directions via data-free pruning and dynamic masking. The authors provide theoretical analyses of privacy guarantees and convergence, and demonstrate strong empirical gains across encoder-only and decoder-only models on diverse tasks, often approaching full-parameter DP fine-tuning with far lower memory consumption. Dynamic ZO scale scheduling and stagewise optimization further stabilize training, while pruning strategies (static, dynamic, incremental) substantially boost utility under DP. Overall, the proposed DP-ZO family offers memory-efficient, scalable, and high-utility DP fine-tuning for large language models with broad applicability to classification, QA, translation, and summarization tasks.

Abstract

Fine-tuning on task-specific datasets is a widely-embraced paradigm of harnessing the powerful capability of pretrained LLMs for various downstream tasks. Due to the popularity of LLMs fine-tuning and its accompanying privacy concerns, differentially private (DP) fine-tuning of pretrained LLMs has been widely used to safeguarding the privacy of task-specific datasets. Lying at the design core of DP LLM fine-tuning methods is the satisfactory tradeoff among privacy, utility, and scalability. Most existing methods build upon the seminal work of DP-SGD. Despite pushing the scalability of DP-SGD to its limit, DP-SGD-based fine-tuning methods are unfortunately limited by the inherent inefficiency of SGD. In this paper, we investigate the potential of DP zeroth-order methods for LLM pretraining, which avoids the scalability bottleneck of SGD by approximating the gradient with the more efficient zeroth-order gradient. Rather than treating the zeroth-order method as a drop-in replacement for SGD, this paper presents a comprehensive study both theoretically and empirically. First, we propose the stagewise DP zeroth-order method (DP-ZOSO) that dynamically schedules key hyperparameters. This design is grounded on the synergy between DP random perturbation and the gradient approximation error of the zeroth-order method, and its effect on fine-tuning trajectory. We provide theoretical analysis for both proposed methods. We conduct extensive empirical analysis on both encoder-only masked language model and decoder-only autoregressive language model, achieving impressive results in terms of scalability and utility regardless of the class of tasks (compared with DPZero, DP-ZOPO improves on SST-5, on MNLI with RoBERTa-Large and 9.2\% on CB, 3.9\% on BoolQ with OPT-2.7b when , demonstrates more significant enhancement in performance on more complicated tasks).
Paper Structure (30 sections, 14 theorems, 52 equations, 11 figures, 21 tables, 5 algorithms)

This paper contains 30 sections, 14 theorems, 52 equations, 11 figures, 21 tables, 5 algorithms.

Key Result

Lemma 1

There exist constant $c_1$ and $c_2$ so that given the sampling probability $q=m/n$ and the number of steps $T$, for any $\epsilon< c_1q^2T$, DP-SGD is $(\epsilon,\delta)$-differentially private for any $\delta> 0$ if we choose

Figures (11)

  • Figure 1: Experiments on RoBERTAa-large. We report zero-shot, DPZero zhang2024dpzero, DP-ZOPO, DP-ZOSO and DP full-parameter fine-tuning (FT) and DP prefix-tuning (FT-prefix). DP-ZOSO and DP-ZOSO both outperform zero-shot, FT-prefix with much less memory. DP-ZOPO far outperforms DPZero, and DP-ZOSO and approaches FT (Detailed numbers in Table \ref{['table robert']}).
  • Figure 2: This figure illustrates the parameter variations in the stagewise algorithm and highlights the key difference between DP-ZOSO and DP-ZOPO, which lies in the sampling space of $\mathbf{v}$. The darker-colored dimensions represent those sampled from a normal distribution with larger standard deviation. The "X" denotes the value of the dimension is 0.
  • Figure 3: Different pruning strategies of dynamic pruning. The colored squares represent the parameters that require fine-tuning in the current stage. In incremental strategy, the highlighted squares indicate the parameters that were fine-tuned in the previous stage and will be trained in the current stage.
  • Figure 4: The GPU memory consumption and running time with OPT-1.3b and OPT-2.7b on SST-2. DP-ZOSO and DP-ZOPO cost less GPU memory consumption and running time.
  • Figure 5: Results of fine-tuned RoBERTa-large on SNLI with zeroth-order and first-order method. In zeroth-order method, fine-tuning with pruning helps with optimization under all private settings.
  • ...and 6 more figures

Theorems & Definitions (26)

  • Definition 1: LLM Fine-tuning
  • Definition 2: Stochastic gradient descent
  • Definition 3: Differential Privacy dwork2006calibrating
  • Definition 4: DP-SGD abadi2016deep
  • Lemma 1: Moments Accountant (Poisson Subsampling) abadi2016deep
  • Lemma 2: Privacy amplification by Poisson Subsampling and Uniform Sampling Without Replacement balle2018privacy
  • Definition 5: Zeroth-order gradient estimation spall1992multivariate
  • Lemma 3
  • Definition 6: Weakly-convex
  • Theorem 1: Privacy analysis of DP-ZOSO
  • ...and 16 more