Table of Contents
Fetching ...

Weights Shuffling for Improving DPSGD in Transformer-based Models

Jungang Yang, Zhe Ji, Liyao Xiang

TL;DR

The paper addresses the challenge of differential privacy for high-dimensional Transformer-based models trained with DPSGD by introducing a weight-shuffling mechanism that leverages permutation invariance in MLP and Transformer blocks. It formulates a theoretical condition for $(\varepsilon, \delta)$-DP under a shuffled Gaussian mechanism and uses a FW-based sum-of-lognormal approximation to handle the resulting mixture distributions, combined with Advanced Composition and Privacy Amplification by Subsampling for budgeting. The authors implement Shuffled DPSGD, detailing the algorithm and a practical privacy accountant, and validate permutation invariance, approximation accuracy, and privacy guarantees through extensive experiments across CV and NLP tasks, including auditing with existing benchmarks. Empirically, Shuffled DPSGD yields higher utility than state-of-the-art baselines at the same privacy level, with modest computational overhead, demonstrating the method’s practicality for large-scale private training of Transformer-based models.

Abstract

Differential Privacy (DP) mechanisms, especially in high-dimensional settings, often face the challenge of maintaining privacy without compromising the data utility. This work introduces an innovative shuffling mechanism in Differentially-Private Stochastic Gradient Descent (DPSGD) to enhance the utility of large models at the same privacy guarantee of the unshuffled case. Specifically, we reveal that random shuffling brings additional randomness to the trajectory of gradient descent while not impacting the model accuracy by the permutation invariance property -- the model can be equivalently computed in both forward and backward propagations under permutation. We show that permutation indeed improves the privacy guarantee of DPSGD in theory, but tracking the exact privacy loss on shuffled model is particularly challenging. Hence we exploit the approximation on sum of lognormal distributions to derive the condition for the shuffled DPSGD to meet the DP guarantee. Auditing results show that our condition offers a DP guarantee quite close to the audited privacy level, demonstrating our approach an effective estimation in practice. Experimental results have verified our theoretical derivation and illustrate that our mechanism improves the accuracy of DPSGD over the state-of-the-art baselines on a variety of models and tasks.

Weights Shuffling for Improving DPSGD in Transformer-based Models

TL;DR

The paper addresses the challenge of differential privacy for high-dimensional Transformer-based models trained with DPSGD by introducing a weight-shuffling mechanism that leverages permutation invariance in MLP and Transformer blocks. It formulates a theoretical condition for -DP under a shuffled Gaussian mechanism and uses a FW-based sum-of-lognormal approximation to handle the resulting mixture distributions, combined with Advanced Composition and Privacy Amplification by Subsampling for budgeting. The authors implement Shuffled DPSGD, detailing the algorithm and a practical privacy accountant, and validate permutation invariance, approximation accuracy, and privacy guarantees through extensive experiments across CV and NLP tasks, including auditing with existing benchmarks. Empirically, Shuffled DPSGD yields higher utility than state-of-the-art baselines at the same privacy level, with modest computational overhead, demonstrating the method’s practicality for large-scale private training of Transformer-based models.

Abstract

Differential Privacy (DP) mechanisms, especially in high-dimensional settings, often face the challenge of maintaining privacy without compromising the data utility. This work introduces an innovative shuffling mechanism in Differentially-Private Stochastic Gradient Descent (DPSGD) to enhance the utility of large models at the same privacy guarantee of the unshuffled case. Specifically, we reveal that random shuffling brings additional randomness to the trajectory of gradient descent while not impacting the model accuracy by the permutation invariance property -- the model can be equivalently computed in both forward and backward propagations under permutation. We show that permutation indeed improves the privacy guarantee of DPSGD in theory, but tracking the exact privacy loss on shuffled model is particularly challenging. Hence we exploit the approximation on sum of lognormal distributions to derive the condition for the shuffled DPSGD to meet the DP guarantee. Auditing results show that our condition offers a DP guarantee quite close to the audited privacy level, demonstrating our approach an effective estimation in practice. Experimental results have verified our theoretical derivation and illustrate that our mechanism improves the accuracy of DPSGD over the state-of-the-art baselines on a variety of models and tasks.
Paper Structure (30 sections, 50 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 30 sections, 50 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Left: two Gaussian distributions. Right: the distributions after random shuffling. The $\mathcal{N}((-2,0), I)$ turns into a Gaussian mixture distributed around $(-2,0)$ and $(0,-2)$ and similarly for $\mathcal{N}((2,0), I)$. We estimate the distributional distances by $\|P(x,y) - Q(x,y))\|_F$, where $\|\cdot\|_F$ means Frobenius norm, on an area $x,y \in [-10, 10]$. The measured distance on the left panel is 18.53 which is larger than 13.96 of the right panel.
  • Figure 2: The comparison of different approximation methods for the sum of log-normal distribution. We set $\sigma=0.25$ and dimension $d = 10^{8}$ to depict its CDF curve.
  • Figure 3: Comparison of the relations between $\sigma$ and $\varepsilon$ in both unshuffled and shuffled DPSGD. For (a)-(e) we calculate different $\sigma$s with $\varepsilon \in \{0.25, 0.5, 1, 2, 4\}$. And "GPT2-L E" represent the GPT-2-large model trained with E2E dataset and "GPT2-L D" represent the GPT-2-large model trained with DART dataset.
  • Figure 4: The $\sigma$s from the unshuffled ($d=1$) and shuffled DPSGD under different $\varepsilon$s and $d$s.

Theorems & Definitions (6)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof