Table of Contents
Fetching ...

Federated Dynamical Low-Rank Training with Global Loss Convergence Guarantees

Steffen Schotthöfer, M. Paul Laiu

TL;DR

This work builds upon dynamical low-rank splitting schemes for manifold-constrained optimization to create a global low-rank basis of network weights, which enables client training on a small coefficient matrix and incorporates a variance correction scheme.

Abstract

In this work, we propose a federated dynamical low-rank training (FeDLRT) scheme to reduce client compute and communication costs - two significant performance bottlenecks in horizontal federated learning. Our method builds upon dynamical low-rank splitting schemes for manifold-constrained optimization to create a global low-rank basis of network weights, which enables client training on a small coefficient matrix. A consistent global low-rank basis allows us to incorporate a variance correction scheme and prove global loss descent and convergence to a stationary point. Dynamic augmentation and truncation of the low-rank bases automatically optimizes computing and communication resource utilization. We demonstrate the efficiency of FeDLRT in an array of computer vision benchmarks and show a reduction of client compute and communication costs by up to an order of magnitude with minimal impacts on global accuracy.

Federated Dynamical Low-Rank Training with Global Loss Convergence Guarantees

TL;DR

This work builds upon dynamical low-rank splitting schemes for manifold-constrained optimization to create a global low-rank basis of network weights, which enables client training on a small coefficient matrix and incorporates a variance correction scheme.

Abstract

In this work, we propose a federated dynamical low-rank training (FeDLRT) scheme to reduce client compute and communication costs - two significant performance bottlenecks in horizontal federated learning. Our method builds upon dynamical low-rank splitting schemes for manifold-constrained optimization to create a global low-rank basis of network weights, which enables client training on a small coefficient matrix. A consistent global low-rank basis allows us to incorporate a variance correction scheme and prove global loss descent and convergence to a stationary point. Dynamic augmentation and truncation of the low-rank bases automatically optimizes computing and communication resource utilization. We demonstrate the efficiency of FeDLRT in an array of computer vision benchmarks and show a reduction of client compute and communication costs by up to an order of magnitude with minimal impacts on global accuracy.

Paper Structure

This paper contains 28 sections, 20 theorems, 73 equations, 8 figures, 2 tables, 6 algorithms.

Key Result

Lemma 1

${\widetilde{S}}={\widetilde{U}}^\top U^t S^t V^{t,\top} {\widetilde{V}}$ takes the form ${\widetilde{S}} = $.

Figures (8)

  • Figure 1: Federated, heterogeneous least squares regression problem, see \ref{['subsec:lin_regression']}, for $C=4$ clients, $s_*=100$ iterations, learning rate $\lambda=1\rm{e}-3$ and $C$ rank-$1$ local target functions. FL methods without variance correction plateau quickly, whereas FedLin and FeDLRT with variance correction converge to $1\rm{e}-5$. FeDLRT converges faster than FedLin and has lower communication costs.
  • Figure 2: Communication of FeDLRT without variance correction. 1) Broadcast current global basis $U,V$ (blue). 2) Aggregate basis gradients $G_{c,U}, G_{c,V}$ (orange). 3) Broadcast global augmented basis $\Bar{U},\Bar{V}$ (green). 4) Aggregate individual client coefficient update $\widetilde{S}_c^{s_*}$(purple).
  • Figure 3: Scaling of communication cost (top) compute cost at a single client (middle), and client memory footprint (bottom) for $s_*=1$ client iteration and a single data-point for $W\in\mathbb{R}^{n\times n}$ with $n=512$. The costs drop by orders of magnitude after the amortization point of $r\approx 200$, which is $40\%$ of full rank. The numerical evaluations in \ref{['sec:experiments']} show that, in practice, the matrix ranks are typically below the amortization threshold.
  • Figure 4: Comparison between FeDLRT with simplified variance correction and FedLin in the homogeneous linear least squares regression test. Each line represents the median result of $20$ random initialization with $C$ clients. The plots from left to right show the rank evolution, the distance to the global optimizer, the global loss values by FeDLRT, and the global loss values by FedLin. The results show that FeDLRT converges faster in this low-rank test case by identifying (and never underestimating) the target rank $r=4$ early in the training.
  • Figure 5: Comparisons for training ResNet18 on CIFAR10 benchmark. Top row compares FeDLRT without variance correction to FedAvg, middle and bottom rows compare FeDLRT with full and simplified variance correction to FedLin, respectively. In each row, the left two panels show the model compression ratio and the communication cost reduction from FeDLRT, and the right two panels show the validation accuracy for FeDLRT and the full-rank counterparts. In each plot, the results are reported for $C=1,\dots,16$ or $32$ clients with $240/C$ local iterations. FeDLRT matches the accuracy of FedAvg and FedLin well, while substantially reducing the server and client memory and communication costs. Variance correction leads to an up to $12\%$ increase in validation accuracy for large $C$, mitigating the client drift problem. The simplified variance correction (bottom row) gives comparable results to full version (middle row) at a lower communication and computation cost.
  • ...and 3 more figures

Theorems & Definitions (31)

  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Corollary 1
  • Theorem 5
  • Lemma 2
  • proof
  • proof
  • ...and 21 more