Table of Contents
Fetching ...

Unlocking FedNL: Self-Contained Compute-Optimized Implementation

Konstantin Burlachenko, Peter Richtárik

TL;DR

This work tackles the gap between FedNL theory and practical deployment by delivering a self-contained, compute-optimized FedNL implementation suitable for single- and multi-node Federated Learning. It introduces two practical compressors, TopLEK and RandSeqK, and demonstrates that careful software engineering—beyond algorithmic advances—yields up to ~x1000 wall-clock improvements for convex problems like logistic regression. The authors show FedNL outperforms industry baselines (CVXPY, Spark, Ray) in both single- and multi-node settings, while maintaining convergence guarantees through FedNL-LS and FedNL-PP extensions. The study also highlights design choices around domain-specific compute architectures, memory hierarchy, and TCP/IP networking to achieve realistic performance on resource-constrained platforms. Overall, the work emphasizes that translating theory into practice requires integrated attention to algorithms, compilers, memory access patterns, and system-level optimizations to enable robust, autonomous FL tooling on diverse hardware.

Abstract

Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner, eliminating the need for sharing their local data. The recent work (arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization. However, the reference FedNL prototype exhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch a single experiment in a sever-grade workstation; (ii) The prototype only simulates multi-node setting; (iii) Prototype integration into resource-constrained applications is challenging. To bridge the gap between theory and practice, we present a self-contained implementation of FedNL, FedNL-LS, FedNL-PP for single-node and multi-node settings. Our work resolves the aforementioned issues and reduces the wall clock time by x1000. With this FedNL outperforms alternatives for training logistic regression in a single-node -- CVXPY (arXiv:1603.00943), and in a multi-node -- Apache Spark (arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose two practical-orientated compressors for FedNL - adaptive TopLEK and cache-aware RandSeqK, which fulfill the theory of FedNL.

Unlocking FedNL: Self-Contained Compute-Optimized Implementation

TL;DR

This work tackles the gap between FedNL theory and practical deployment by delivering a self-contained, compute-optimized FedNL implementation suitable for single- and multi-node Federated Learning. It introduces two practical compressors, TopLEK and RandSeqK, and demonstrates that careful software engineering—beyond algorithmic advances—yields up to ~x1000 wall-clock improvements for convex problems like logistic regression. The authors show FedNL outperforms industry baselines (CVXPY, Spark, Ray) in both single- and multi-node settings, while maintaining convergence guarantees through FedNL-LS and FedNL-PP extensions. The study also highlights design choices around domain-specific compute architectures, memory hierarchy, and TCP/IP networking to achieve realistic performance on resource-constrained platforms. Overall, the work emphasizes that translating theory into practice requires integrated attention to algorithms, compilers, memory access patterns, and system-level optimizations to enable robust, autonomous FL tooling on diverse hardware.

Abstract

Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner, eliminating the need for sharing their local data. The recent work (arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization. However, the reference FedNL prototype exhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch a single experiment in a sever-grade workstation; (ii) The prototype only simulates multi-node setting; (iii) Prototype integration into resource-constrained applications is challenging. To bridge the gap between theory and practice, we present a self-contained implementation of FedNL, FedNL-LS, FedNL-PP for single-node and multi-node settings. Our work resolves the aforementioned issues and reduces the wall clock time by x1000. With this FedNL outperforms alternatives for training logistic regression in a single-node -- CVXPY (arXiv:1603.00943), and in a multi-node -- Apache Spark (arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose two practical-orientated compressors for FedNL - adaptive TopLEK and cache-aware RandSeqK, which fulfill the theory of FedNL.

Paper Structure

This paper contains 132 sections, 25 equations, 12 figures, 11 tables, 4 algorithms.

Figures (12)

  • Figure 1: FedNL-LS simulation in a single-node, $1000$ rounds, theoretical step-size, FP64. Line search parameters $c=0.49,\gamma=0.5$. Dataset W8A ($49749$ samples) augmented with intercept split to $n_i=350$ samples/client.
  • Figure 2: FedNL-LS simulation in a single-node, $1000$ rounds, theoretical step-size, FP64. Line search parameters $c=0.49,\gamma=0.5$. Dataset A9A ($32561$ samples) augmented with intercept split to $n_i=229$ samples/client.
  • Figure 3: FedNL-LS simulation in a single-node, $2000$ rounds, theoretical step-size, FP64. Line search parameters $c=0.49,\gamma=0.5$. Dataset PHISHING ( $11055$ samples) augmented with intercept split to $n_i=77$ samples/client.
  • Figure 4: FedNL in multi-node setting, theoretical step-size, $n=50$, FP64 arithmetic, 1 CPU core per node and master, TCP/IPv4, dataset W8A reshuffled u.a.r. and augmented with intercept.
  • Figure 5: FedNL-LS in multi-node setting, $n=50$, FP64 arithmetic, 1 CPU core per node and master, TCP/IPv4, dataset W8A reshuffled u.a.r. and augmented with intercept. The line search parameters $c=0.49,\gamma=0.5$.
  • ...and 7 more figures