Table of Contents
Fetching ...

Stochastic Unrolled Federated Learning

Samar Hadou, Navid NaderiAlizadeh, Alejandro Ribeiro

TL;DR

Stochastic UnRolled Federated Learning (SURF) integrates algorithm unrolling with federated learning to accelerate convergence under decentralized topologies. It employs stochastic unrolling with layer-wise mini-batches and descent constraints to ensure stochastic descent, and unrolls decentralized gradient descent using a GNN-based architecture (U-DGD) to handle both decentralized FL and classical star FL, with theoretical guarantees of near-optimal convergence and exponential rates. Empirical results on image-classification tasks show faster convergence and robustness to data heterogeneity and asynchronous updates, highlighting SURF’s potential to move heavy optimization offline on servers while preserving decentralized advantages. Overall, SURF offers a principled L2L framework that improves FL efficiency without sacrificing the decentralized/centerless nature of modern federated systems.

Abstract

Algorithm unrolling has emerged as a learning-based optimization paradigm that unfolds truncated iterative algorithms in trainable neural-network optimizers. We introduce Stochastic UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning in order to expedite its convergence. Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolled optimizers to find a descent direction and the decentralized nature of federated learning. We circumvent the former challenge by feeding stochastic mini-batches to each unrolled layer and imposing descent constraints to guarantee its convergence. We address the latter challenge by unfolding the distributed gradient descent (DGD) algorithm in a graph neural network (GNN)-based unrolled architecture, which preserves the decentralized nature of training in federated learning. We theoretically prove that our proposed unrolled optimizer converges to a near-optimal region infinitely often. Through extensive numerical experiments, we also demonstrate the effectiveness of the proposed framework in collaborative training of image classifiers.

Stochastic Unrolled Federated Learning

TL;DR

Stochastic UnRolled Federated Learning (SURF) integrates algorithm unrolling with federated learning to accelerate convergence under decentralized topologies. It employs stochastic unrolling with layer-wise mini-batches and descent constraints to ensure stochastic descent, and unrolls decentralized gradient descent using a GNN-based architecture (U-DGD) to handle both decentralized FL and classical star FL, with theoretical guarantees of near-optimal convergence and exponential rates. Empirical results on image-classification tasks show faster convergence and robustness to data heterogeneity and asynchronous updates, highlighting SURF’s potential to move heavy optimization offline on servers while preserving decentralized advantages. Overall, SURF offers a principled L2L framework that improves FL efficiency without sacrificing the decentralized/centerless nature of modern federated systems.

Abstract

Algorithm unrolling has emerged as a learning-based optimization paradigm that unfolds truncated iterative algorithms in trainable neural-network optimizers. We introduce Stochastic UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning in order to expedite its convergence. Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolled optimizers to find a descent direction and the decentralized nature of federated learning. We circumvent the former challenge by feeding stochastic mini-batches to each unrolled layer and imposing descent constraints to guarantee its convergence. We address the latter challenge by unfolding the distributed gradient descent (DGD) algorithm in a graph neural network (GNN)-based unrolled architecture, which preserves the decentralized nature of training in federated learning. We theoretically prove that our proposed unrolled optimizer converges to a near-optimal region infinitely often. Through extensive numerical experiments, we also demonstrate the effectiveness of the proposed framework in collaborative training of image classifiers.
Paper Structure (22 sections, 4 theorems, 41 equations, 8 figures, 1 algorithm)

This paper contains 22 sections, 4 theorems, 41 equations, 8 figures, 1 algorithm.

Key Result

Theorem 4.1

A stationary point of eq:dual is a near-optimal and near-feasible solution to eq:surf under some mild assumptions. That is, for each $l$, with probability $1-\delta$, and $\zeta(Q, \delta)$ measures the sample complexity.

Figures (8)

  • Figure 1: Unrolled network $\boldsymbol{\Phi}({\bf u}; \boldsymbol{\theta})$. Each unrolled layer resembles an update rule $\phi$ of a standard algorithm whose hyperparameters $\boldsymbol{\theta} = \{\boldsymbol{\theta}_l\}_l$ are now set free to learn.
  • Figure 2: The formulation in \ref{['eq:FedLess']} supports both (a) decentralized federated learning, where each agent $i\in\{1, \dots, n\}$ has a local variable ${\bf w}_i$, and (b) classical federated learning, where a central server node ensures that all local variables are equal across the network.
  • Figure 3: One iteration of \ref{['alg:PD']}. One dataset $\bf D$ is chosen randomly from the meta-training dataset and divided into training and testing examples. Mini-batches of the training examples are randomly selected and fed to the unrolled layers (in gray) to predict ${\bf W}_L$. Given the latter, the loss function $\widehat{\cal L}$ is computed over the testing examples, averaged over all agents, and its gradients update the parameters $\boldsymbol{\theta}$ and $\boldsymbol{\lambda}$.
  • Figure 4: An unrolled layer $\phi({\bf W}_{l-1}, {{\bf B}_l}; {\bf h}_l, {\bf M}_l, {\bf d}_l)$ of U-DGD at agent $i$. The block on top is a graph filter, parameterized by ${\bf h}_l$, which performs $K$ communication rounds, and the block underneath is a single-layer MLP, parameterized by ${\bf M}_l$ and ${\bf d}_l$. All agents share the same learnable parameters.
  • Figure 5: Convergence rate. Comparisons between the accuracy of U-DGD and state-of-the-art FL methods for both i) decentralized FL over $3$-degree regular graphs (left) and random graphs (middle), and ii) classical FL with a star graph (right). U-DGD scores higher convergence rates in all settings surpassing both decentralized and centralized FL methods.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Remark 3.1
  • Theorem 4.1: CLT (informal)
  • Theorem 4.2
  • Theorem 4.3
  • Remark 5.1
  • Theorem 2.6: CLT chamon2022constrained
  • proof