Table of Contents
Fetching ...

Lap2: Revisiting Laplace DP-SGD for High Dimensions via Majorization Theory

Meisam Mohammady, Qin Yang, Nicholas Stout, Ayesha Samreen, Han Wang, Christopher J Quinn, Yuan Hong

TL;DR

Lap2 is introduced, a new solution that enables L2 clipping for Laplace DP-SGD while preserving strong privacy guarantees, and yields a multivariate privacy accountant that scales gracefully with model dimension and enables the use of thousands of moments.

Abstract

Differentially Private Stochastic Gradient Descent (DP-SGD) is a cornerstone technique for ensuring privacy in deep learning, widely used in both training from scratch and fine-tuning large-scale language models. While DP-SGD predominantly relies on the Gaussian mechanism, the Laplace mechanism remains underutilized due to its reliance on L1 norm clipping. This constraint severely limits its practicality in high-dimensional models because the L1 norm of an n-dimensional gradient can be up to sqrt(n) times larger than its L2 norm. As a result, the required noise scale grows significantly with model size, leading to poor utility or untrainable models. In this work, we introduce Lap2, a new solution that enables L2 clipping for Laplace DP-SGD while preserving strong privacy guarantees. We overcome the dimensionality-driven clipping barrier by computing coordinate-wise moment bounds and applying majorization theory to construct a tight, data-independent upper bound over the full model. By exploiting the Schur-convexity of the moment accountant function, we aggregate these bounds using a carefully designed majorization set that respects the L2 clipping constraint. This yields a multivariate privacy accountant that scales gracefully with model dimension and enables the use of thousands of moments. Empirical evaluations demonstrate that our approach significantly improves the performance of Laplace DP-SGD, achieving results comparable to or better than Gaussian DP-SGD under strong privacy constraints. For instance, fine-tuning RoBERTa-base (125M parameters) on SST-2 achieves 87.88% accuracy at epsilon=0.54, outperforming Gaussian (87.16%) and standard Laplace (48.97%) under the same budget.

Lap2: Revisiting Laplace DP-SGD for High Dimensions via Majorization Theory

TL;DR

Lap2 is introduced, a new solution that enables L2 clipping for Laplace DP-SGD while preserving strong privacy guarantees, and yields a multivariate privacy accountant that scales gracefully with model dimension and enables the use of thousands of moments.

Abstract

Differentially Private Stochastic Gradient Descent (DP-SGD) is a cornerstone technique for ensuring privacy in deep learning, widely used in both training from scratch and fine-tuning large-scale language models. While DP-SGD predominantly relies on the Gaussian mechanism, the Laplace mechanism remains underutilized due to its reliance on L1 norm clipping. This constraint severely limits its practicality in high-dimensional models because the L1 norm of an n-dimensional gradient can be up to sqrt(n) times larger than its L2 norm. As a result, the required noise scale grows significantly with model size, leading to poor utility or untrainable models. In this work, we introduce Lap2, a new solution that enables L2 clipping for Laplace DP-SGD while preserving strong privacy guarantees. We overcome the dimensionality-driven clipping barrier by computing coordinate-wise moment bounds and applying majorization theory to construct a tight, data-independent upper bound over the full model. By exploiting the Schur-convexity of the moment accountant function, we aggregate these bounds using a carefully designed majorization set that respects the L2 clipping constraint. This yields a multivariate privacy accountant that scales gracefully with model dimension and enables the use of thousands of moments. Empirical evaluations demonstrate that our approach significantly improves the performance of Laplace DP-SGD, achieving results comparable to or better than Gaussian DP-SGD under strong privacy constraints. For instance, fine-tuning RoBERTa-base (125M parameters) on SST-2 achieves 87.88% accuracy at epsilon=0.54, outperforming Gaussian (87.16%) and standard Laplace (48.97%) under the same budget.
Paper Structure (23 sections, 7 theorems, 70 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 7 theorems, 70 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 2.4

Let $q : \mathcal{D} \to \mathbb{R}$ be a query with (global) $\ell_1$-sensitivity $\Delta_1 q \;=\; \max_{d,d' : \mathrm{Adj}(d,d')} \,\bigl\|\,q(d) - q(d')\bigr\|_{1}$. If the Laplace mechanism $M_q(d, b)$ uses a scale parameter $b \ge \tfrac{\Delta q}{\epsilon}$, then $M_q(d, b)$ is $\epsilon$-di

Figures (6)

  • Figure 1: Comparison of $\ell_1$ and $\ell_2$ norm clipped spaces.
  • Figure 2: Illustration of two-sided privacy walls (left and right walls) during CNN training on the MNIST dataset under DP-SGD. The left wall corresponds to the high-privacy regime, where the noise scale saturates and $\delta(\epsilon)$ approaches 1, indicating limited further privacy gain. The right wall corresponds to the low-privacy regime, where the effective signal-to-noise ratio no longer improves with larger $\epsilon$. Together, these walls define the practical operating range (privacy corridor) in which training remains both private and useful.
  • Figure 3: Accuracy vs. privacy budget $\epsilon$ for different clipping norms $C$ using Lap2 (CNN on MNIST, $\delta=10^{-5}$). Increasing $C$ improves utility under moderate $\epsilon$ due to stronger signal retention, but high $C$ becomes suboptimal for very tight $\epsilon$ as noise grows superlinearly.
  • Figure 4: Accuracy of a CNN trained on MNIST under DP-SGD with Lap2 for $20$ epochs. Configurations $(q, b, c)$ were selected via grid search for each target privacy budget $\epsilon \in \{0.5, 1, 2\}$ and $\delta=10^{-5}$, with color indicating the resulting accuracy. The star marks the configuration achieving maximum accuracy for each $\epsilon$.
  • Figure 5: Gaussian and Lap2 mechanisms on the DistilGPT-2 model and E2E dataset for the generation task (batch size $B=80$, clipping value $C=2$, and $\epsilon=1$).
  • ...and 1 more figures

Theorems & Definitions (18)

  • Definition 2.1: Differential Privacy Dwork06
  • Definition 2.2
  • Definition 2.3: Laplace Mechanism dwork2006our
  • Theorem 2.4: Laplace Mechanism Dwork10
  • Definition 2.5: Approximate Differential Privacy dwork2006calibrating
  • Definition 2.6: Gaussian Mechanism dwork2006calibrating
  • Theorem 3.1: Privacy Loss of a Laplace Mechanism
  • proof
  • Theorem 3.2: Subsampled Uni-variate Laplace Mechanisms
  • Definition 3.3: Weak Majorization
  • ...and 8 more