Table of Contents
Fetching ...

Learning Rate Scheduling with Matrix Factorization for Private Training

Nikita P. Kalinin, Joel Daniel Andersson

TL;DR

This work addresses private training under differential privacy when using learning rate schedules. It develops general upper and lower bounds for MaxSE and MeanSE for a broad class of schedulers and introduces a learning-rate–aware Toeplitz factorization that is memory-efficient. Theoretical results show optimal or improved error rates for exponential decays, with multi-epoch extensions via banded inverses, and experiments on CIFAR-10 and IMDB validate accuracy gains over baseline prefix-sum factorizations. The findings advance private training by marrying practical LR schedules with correlated noise through tailored factorizations, enabling higher utility under strict privacy constraints.

Abstract

We study differentially private model training with stochastic gradient descent under learning rate scheduling and correlated noise. Although correlated noise, in particular via matrix factorizations, has been shown to improve accuracy, prior theoretical work focused primarily on the prefix-sum workload. That workload assumes a constant learning rate, whereas in practice learning rate schedules are widely used to accelerate training and improve convergence. We close this gap by deriving general upper and lower bounds for a broad class of learning rate schedules in both single- and multi-epoch settings. Building on these results, we propose a learning-rate-aware factorization that achieves improvements over prefix-sum factorizations under both MaxSE and MeanSE error metrics. Our theoretical analysis yields memory-efficient constructions suitable for practical deployment, and experiments on CIFAR-10 and IMDB datasets confirm that schedule-aware factorizations improve accuracy in private training.

Learning Rate Scheduling with Matrix Factorization for Private Training

TL;DR

This work addresses private training under differential privacy when using learning rate schedules. It develops general upper and lower bounds for MaxSE and MeanSE for a broad class of schedulers and introduces a learning-rate–aware Toeplitz factorization that is memory-efficient. Theoretical results show optimal or improved error rates for exponential decays, with multi-epoch extensions via banded inverses, and experiments on CIFAR-10 and IMDB validate accuracy gains over baseline prefix-sum factorizations. The findings advance private training by marrying practical LR schedules with correlated noise through tailored factorizations, enabling higher utility under strict privacy constraints.

Abstract

We study differentially private model training with stochastic gradient descent under learning rate scheduling and correlated noise. Although correlated noise, in particular via matrix factorizations, has been shown to improve accuracy, prior theoretical work focused primarily on the prefix-sum workload. That workload assumes a constant learning rate, whereas in practice learning rate schedules are widely used to accelerate training and improve convergence. We close this gap by deriving general upper and lower bounds for a broad class of learning rate schedules in both single- and multi-epoch settings. Building on these results, we propose a learning-rate-aware factorization that achieves improvements over prefix-sum factorizations under both MaxSE and MeanSE error metrics. Our theoretical analysis yields memory-efficient constructions suitable for practical deployment, and experiments on CIFAR-10 and IMDB datasets confirm that schedule-aware factorizations improve accuracy in private training.

Paper Structure

This paper contains 27 sections, 44 theorems, 200 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let $(\chi_t)_{t=1}^n$ be a sequence on $[\beta, 1]$ for some constant $\beta > 0$. For $n \geq 2$ we define If either of the following two conditions holds ($c>0$ an absolute constant): then the factorization $B_{\chi} \times A_1^{1/2}$, where $B_{\chi}: = A_{\chi}(A_1)^{-1/2}$, satisfies

Figures (6)

  • Figure 1: Comparison of MaxSE and MeanSE errors under an exponentially decaying learning rate, for the proposed factorizations (see Table \ref{['tab:exponential_decay']}), with fixed matrix size $n = 2048$ and varying decay $\beta$. We refer to the approximately optimal value of MeanSE computed by dense factorization denisov2022improved as "dense." For MaxSE, we report a lower bound since no scalable and accurate solution for its optimal value is available. The bottom row compares our learning-rate aware factorization with the prefix-sum based one, validating the theoretical improvements in both MeanSE and MaxSE.
  • Figure 2: Multi-participation MeanSE error with matrix size $n = 2048$. Lines are computed for bandwidth $p = 64$. For the exponential workload, we observe that with a larger participation number it becomes beneficial to optimize the factorization with respect to the learning rate decay workload. However, for the considered values of $n$ and $\beta$, we do not observe any benefit from incorporating learning rate scheduling for BISR.
  • Figure 3: CIFAR-10 results under $(9,10^{-5})$-differential privacy. (a) Validation accuracy with exponential learning rate scheduling for different learning rates in DP-SGD. We report the points corresponding to the lowest learning rate; for example, a learning rate of $1/2$ for $\beta = 1/4$ indicates that training starts with a learning rate of $2$ and decays to $1/2$. (b) Test accuracy across different matrix factorizations with exponential learning rate scheduling. Training hyperparameters are provided in Table \ref{['tab:mf-hparams']}.
  • Figure 4: Test accuracy of different learning rate schedulers for (a) BERT-base on IMDB and (b) CNN on CIFAR-10 under differential privacy with $\varepsilon = 4$ and $\varepsilon = 9$, respectively. Training hyperparameters are listed in Table \ref{['tab:mf-schedulers']}.
  • Figure 5: Comparison of different LR schedulers ($n=2048$) in single participation.
  • ...and 1 more figures

Theorems & Definitions (77)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Corollary 1
  • Corollary 2
  • Lemma 2
  • Theorem 3
  • Corollary 3
  • Theorem 4: Lower bound for multi-participation
  • Corollary 4
  • ...and 67 more