Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy

Anastasia Koloskova; Ryan McKenna; Zachary Charles; Keith Rush; Brendan McMahan

Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy

Anastasia Koloskova, Ryan McKenna, Zachary Charles, Keith Rush, Brendan McMahan

TL;DR

This paper analyzes gradient descent when the injected noise across iterations is linearly correlated, a scenario motivated by DP-FTRL and MF-DP-FTRL approaches in differential privacy. It develops a restart-based analytic framework that yields tighter convergence rates for both PGD and Anti-PGD under $L$-smoothness, and shows a nuanced dependence of the convergence on the row differences of the factorization component B within a window of size $ au$. Building on these insights, the authors propose a modified offline factorization objective, introducing DP-MF+ which minimizes a Lambda_tau-weighted noise proxy to better capture optimization performance; they demonstrate theoretical improvements and validate them with synthetic and real-data experiments, including MNIST, CIFAR-10, and Stack Overflow tasks. The results illuminate how linearly correlated noise can be harnessed to improve privacy-utility trade-offs and guide the design of matrix-factorization based DP mechanisms, while also identifying open questions such as clipping, momentum, and last-iterate convergence for broader noise structures.

Abstract

We study gradient descent under linearly correlated noise. Our work is motivated by recent practical methods for optimization with differential privacy (DP), such as DP-FTRL, which achieve strong performance in settings where privacy amplification techniques are infeasible (such as in federated learning). These methods inject privacy noise through a matrix factorization mechanism, making the noise linearly correlated over iterations. We propose a simplified setting that distills key facets of these methods and isolates the impact of linearly correlated noise. We analyze the behavior of gradient descent in this setting, for both convex and non-convex functions. Our analysis is demonstrably tighter than prior work and recovers multiple important special cases exactly (including anticorrelated perturbed gradient descent). We use our results to develop new, effective matrix factorizations for differentially private optimization, and highlight the benefits of these factorizations theoretically and empirically.

Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy

TL;DR

-smoothness, and shows a nuanced dependence of the convergence on the row differences of the factorization component B within a window of size

. Building on these insights, the authors propose a modified offline factorization objective, introducing DP-MF+ which minimizes a Lambda_tau-weighted noise proxy to better capture optimization performance; they demonstrate theoretical improvements and validate them with synthetic and real-data experiments, including MNIST, CIFAR-10, and Stack Overflow tasks. The results illuminate how linearly correlated noise can be harnessed to improve privacy-utility trade-offs and guide the design of matrix-factorization based DP mechanisms, while also identifying open questions such as clipping, momentum, and last-iterate convergence for broader noise structures.

Abstract

Paper Structure (51 sections, 7 theorems, 82 equations, 6 figures, 2 tables)

This paper contains 51 sections, 7 theorems, 82 equations, 6 figures, 2 tables.

Introduction
Related Work
Matrix mechanisms for differential privacy.
SGD with correlated noise.
SGD with biased noise.
Background
Matrix Factorization and Privacy Mechanisms
Finding good factorizations.
Finding improved factorizations.
Problem Formulation
Deriving Tighter Convergence Rates
Convergence Rates for PGD and Anti-PGD
PGD.
Anti-PGD.
Tightness.
...and 36 more sections

Key Result

Proposition 4.4

Under Assumptions as:noise, as:smooth and as:convex, if $\mathbf{B} = \mathbf{S}$ and $\gamma < 1/2L$, then the output of eq:opt-setup-matrix satisfies

Figures (6)

Figure 1: Two-stage MF-DP-FTRL workflow proposed by denisov2022:matrix-fact. The user selects a workload matrix $\mathbf{A}$ representing a desired first-order optimization method. Offline, the user finds a factorization $\mathbf{B}\mathbf{C} = \mathbf{A}$, using an objective that balances ERM performance (as a function of $\mathbf{B}$) and privacy (as a function of $\mathbf{C}$). The user applies $\mathbf{A}$ to a downstream ERM task, but with linearly correlated additive noise governed by $\mathbf{B}$.
Figure 2: Comparison of the average and last gradient norms for DP-MF and DP-MF$^+$ on a random non-strongly convex quadratic function with $L = 10$.
Figure 3: Test set accuracy of various mechanisms on the MNIST and CIFAR-10 datasets.
Figure 4: Comparison of PGD and Chess-PGD under the fixed stepsize, $\gamma = 0.02$. Y axis in the log scale on the left, and in the normal scale on the right.
Figure 5: Elements of $\Lambda_{\tau}$ for $T = 12$, and $\tau = 3$.
...and 1 more figures

Theorems & Definitions (12)

Example 2.1: SGD
Example 3.1: PGD
Example 3.2: Anti-PGD
Example 3.3: Tree Aggregation DP-FTRL
Proposition 4.4: Adapted from Dekel12:sgd_convex_proof
Proposition 4.5
Theorem 4.6: non-convex
Theorem 4.7: convex
Example A.1: Chess-PGD
Lemma C.1
...and 2 more

Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy

TL;DR

Abstract

Gradient Descent with Linearly Correlated Noise: Theory and Applications to Differential Privacy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (12)