Unlocking FedNL: Self-Contained Compute-Optimized Implementation
Konstantin Burlachenko, Peter Richtárik
TL;DR
This work tackles the gap between FedNL theory and practical deployment by delivering a self-contained, compute-optimized FedNL implementation suitable for single- and multi-node Federated Learning. It introduces two practical compressors, TopLEK and RandSeqK, and demonstrates that careful software engineering—beyond algorithmic advances—yields up to ~x1000 wall-clock improvements for convex problems like logistic regression. The authors show FedNL outperforms industry baselines (CVXPY, Spark, Ray) in both single- and multi-node settings, while maintaining convergence guarantees through FedNL-LS and FedNL-PP extensions. The study also highlights design choices around domain-specific compute architectures, memory hierarchy, and TCP/IP networking to achieve realistic performance on resource-constrained platforms. Overall, the work emphasizes that translating theory into practice requires integrated attention to algorithms, compilers, memory access patterns, and system-level optimizations to enable robust, autonomous FL tooling on diverse hardware.
Abstract
Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner, eliminating the need for sharing their local data. The recent work (arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization. However, the reference FedNL prototype exhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch a single experiment in a sever-grade workstation; (ii) The prototype only simulates multi-node setting; (iii) Prototype integration into resource-constrained applications is challenging. To bridge the gap between theory and practice, we present a self-contained implementation of FedNL, FedNL-LS, FedNL-PP for single-node and multi-node settings. Our work resolves the aforementioned issues and reduces the wall clock time by x1000. With this FedNL outperforms alternatives for training logistic regression in a single-node -- CVXPY (arXiv:1603.00943), and in a multi-node -- Apache Spark (arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose two practical-orientated compressors for FedNL - adaptive TopLEK and cache-aware RandSeqK, which fulfill the theory of FedNL.
