Private Zeroth-Order Optimization with Public Data
Xuchen Gong, Tian Li
TL;DR
PAZO introduces public-data-assisted private zeroth-order optimization to reduce the privacy-utility gap in DP training. By integrating three PAZO variants—PAZO-M (mixing private zeroth-order estimates with public gradients), PAZO-P (restricting updates to the public gradient subspace), and PAZO-S (selecting the best public gradient)—the framework achieves improved convergence and privacy guarantees while maintaining the efficiency of zeroth-order methods. Theoretical results establish γ-similarity-based convergence with reduced dimension dependence, while empirical results across vision and language tasks show superior privacy/utility tradeoffs and up to 16× speedups over first-order baselines in highly private settings. The approach demonstrates robust performance across pre-training and fine-tuning, highlighting public data as a practical catalyst for DP training in diverse domains.
Abstract
One of the major bottlenecks for deploying popular first-order differentially private (DP) machine learning algorithms (e.g., DP-SGD) lies in their high computation and memory cost, despite the existence of optimized implementations. Zeroth-order methods have promise in mitigating the overhead, as they leverage function evaluations to approximate the gradients, hence significantly easier to privatize. While recent works have explored zeroth-order approaches in both private and non-private settings, they still suffer from relatively low utilities compared with DP-SGD, and have only been evaluated in limited application domains. In this work, we propose to leverage public information to guide and improve gradient approximation of private zeroth-order algorithms. We explore a suite of public-data-assisted zeroth-order optimizers (PAZO) with minimal overhead. We provide theoretical analyses of the PAZO framework under an assumption of the similarity between public and private data. Empirically, we demonstrate that PAZO achieves superior privacy/utility tradeoffs across vision and text tasks in both pre-training and fine-tuning settings, outperforming the best first-order baselines (with public data) especially in highly private regimes, while offering up to $16\times$ runtime speedup.
