Scaling up the Banded Matrix Factorization Mechanism for Differentially Private ML

Ryan McKenna

Scaling up the Banded Matrix Factorization Mechanism for Differentially Private ML

Ryan McKenna

TL;DR

DP-BandMF scales correlated-noise differential privacy for large-scale ML by introducing (i) efficient strategy optimization that avoids dense $n\times n$ matrices, (ii) banded Toeplitz strategy families for further efficiency, and (iii) distributed noise generation across thousands of machines. The approach reduces the dominant computational burdens from $O(n^3)$ time and $O(n^2)$ memory to near-linear scaling in $n$ for structured classes, enabling training with DP over extremely large iteration counts and parameter counts with negligible utility loss. Empirical results show amplified DP-BandMF outperforming DP-SGD and other scalable MF approaches across a range of settings, with the optimal number of bands $b_*$ roughly following $b_* \approx \epsilon \sqrt{n}/k$ and banded Toeplitz variants offering near-optimal performance in very large regimes. The work provides practical, scalable DP matrix-factorization tooling for large-scale private ML, with implications for federated and centralized training where privacy and efficiency must co-exist.

Abstract

Correlated noise mechanisms such as DP Matrix Factorization (DP-MF) have proven to be effective alternatives to DP-SGD in large-epsilon few-epoch training regimes. Significant work has been done to find the best correlated noise strategies, and the current state-of-the-art approach is DP-BandMF, which optimally balances the benefits of privacy amplification and noise correlation. Despite it's utility advantages, severe scalability limitations prevent this mechanism from handling large-scale training scenarios where the number of training iterations may exceed $10^4$ and the number of model parameters may exceed $10^7$. In this work, we present techniques to scale up DP-BandMF along these two dimensions, significantly extending it's reach and enabling it to handle settings with virtually any number of model parameters and training iterations, with negligible utility degradation.

Scaling up the Banded Matrix Factorization Mechanism for Differentially Private ML

TL;DR

DP-BandMF scales correlated-noise differential privacy for large-scale ML by introducing (i) efficient strategy optimization that avoids dense

matrices, (ii) banded Toeplitz strategy families for further efficiency, and (iii) distributed noise generation across thousands of machines. The approach reduces the dominant computational burdens from

time and

memory to near-linear scaling in

for structured classes, enabling training with DP over extremely large iteration counts and parameter counts with negligible utility loss. Empirical results show amplified DP-BandMF outperforming DP-SGD and other scalable MF approaches across a range of settings, with the optimal number of bands

roughly following

and banded Toeplitz variants offering near-optimal performance in very large regimes. The work provides practical, scalable DP matrix-factorization tooling for large-scale private ML, with implications for federated and centralized training where privacy and efficiency must co-exist.

Abstract

and the number of model parameters may exceed

. In this work, we present techniques to scale up DP-BandMF along these two dimensions, significantly extending it's reach and enabling it to handle settings with virtually any number of model parameters and training iterations, with negligible utility degradation.

Paper Structure (38 sections, 6 theorems, 10 equations, 14 figures, 2 tables, 5 algorithms)

This paper contains 38 sections, 6 theorems, 10 equations, 14 figures, 2 tables, 5 algorithms.

Introduction
Background
Training Dynamics
Strategy Optimization
Scalable Strategy Optimization and Noise Generation
Efficient strategy optimization
Optimizing Banded Toeplitz Strategies
Column Normalization
Distributed Noise Generation
Empirical Results
Comparison to Prior and Concurrent Work
Optimal Number of Bands
RMSE vs. Learning Performance
Related Work
More DP-MF Variants
...and 23 more sections

Key Result

Proposition 2.1

Let $\sigma_{SGD}(\epsilon, \delta, k, n)$ denote the noise multiplier required for DP-SGD to achieve $(\epsilon, \delta)$-DP when run for $n$ iterations with sampling probability $k/n$. Given a $b$-banded strategy matrix $\mathbf{C}$ satisfying $\| \mathbf{C} \|_{1,2} \leq 1$, alg:bandmf satisfies

Figures (14)

Figure 1: (a-b) Ratio of RMSE of each strategy to the best strategy that scales to each setting. (a) Compares our scalable banded and banded toeplitz strategies with other banded strategies. (b) Compares our scalable DP-BandMF with non-banded strategies as a function of $\epsilon$.
Figure 2: (a) RMSE Suboptimality Ratio (relative to full-batch DP-SGD) of DP-BandMF as a function of $b$ for various epochs, with fixed $(\epsilon, \delta) = (1, 10^{-8})$ and $n=16384$. (b) Optimal number of bands (within a factor of 2) as a function of the privacy budget and the number of epochs, fixing $n=4096$ and $\delta=10^{-8}$.
Figure 3: (a) Wall Clock Time for correlated noise generation and per-example gradient clipping for a 100M parameter BertBase model when run on $32$ TPU v3 cores. (b-c) RMSE vs. Learning Performance (evaluation cross entropy) with an adaptive optimizer (b) and a non-adaptive optimizer (c). In both (b) and (c), a 4M parameter BertTiny model is trained on the StackOverflow dataset for various noise multipliers.
Figure 4: Wallclock time required to evaluate the total squared error objective function and it's gradient as a function of $n$ for $b=16$. Does not include JIT-compile time, which is amortized over strategy optimization.
Figure 5: Streaming Linear Operator for Prefix
...and 9 more figures

Theorems & Definitions (14)

Proposition 2.1: Noise Calibration choquette2024amplified
Remark 2.1: Memory Overhead
Proposition 2.2: Expected Error choquette2024amplified
Proposition 3.1: Banded Toeplitz Expected Total Squared Error
proof
Example 3.1
Example 3.2
Example 3.3
Example B.1: DP-BandMF
Definition D.1: Streaming Linear Operator (SLO)
...and 4 more

Scaling up the Banded Matrix Factorization Mechanism for Differentially Private ML

TL;DR

Abstract

Scaling up the Banded Matrix Factorization Mechanism for Differentially Private ML

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (14)