Weights Shuffling for Improving DPSGD in Transformer-based Models
Jungang Yang, Zhe Ji, Liyao Xiang
TL;DR
The paper addresses the challenge of differential privacy for high-dimensional Transformer-based models trained with DPSGD by introducing a weight-shuffling mechanism that leverages permutation invariance in MLP and Transformer blocks. It formulates a theoretical condition for $(\varepsilon, \delta)$-DP under a shuffled Gaussian mechanism and uses a FW-based sum-of-lognormal approximation to handle the resulting mixture distributions, combined with Advanced Composition and Privacy Amplification by Subsampling for budgeting. The authors implement Shuffled DPSGD, detailing the algorithm and a practical privacy accountant, and validate permutation invariance, approximation accuracy, and privacy guarantees through extensive experiments across CV and NLP tasks, including auditing with existing benchmarks. Empirically, Shuffled DPSGD yields higher utility than state-of-the-art baselines at the same privacy level, with modest computational overhead, demonstrating the method’s practicality for large-scale private training of Transformer-based models.
Abstract
Differential Privacy (DP) mechanisms, especially in high-dimensional settings, often face the challenge of maintaining privacy without compromising the data utility. This work introduces an innovative shuffling mechanism in Differentially-Private Stochastic Gradient Descent (DPSGD) to enhance the utility of large models at the same privacy guarantee of the unshuffled case. Specifically, we reveal that random shuffling brings additional randomness to the trajectory of gradient descent while not impacting the model accuracy by the permutation invariance property -- the model can be equivalently computed in both forward and backward propagations under permutation. We show that permutation indeed improves the privacy guarantee of DPSGD in theory, but tracking the exact privacy loss on shuffled model is particularly challenging. Hence we exploit the approximation on sum of lognormal distributions to derive the condition for the shuffled DPSGD to meet the DP guarantee. Auditing results show that our condition offers a DP guarantee quite close to the audited privacy level, demonstrating our approach an effective estimation in practice. Experimental results have verified our theoretical derivation and illustrate that our mechanism improves the accuracy of DPSGD over the state-of-the-art baselines on a variety of models and tasks.
