Provably Doubly Accelerated Federated Learning: The First Theoretically Successful Combination of Local Training and Communication Compression
Laurent Condat, Ivan Agarský, Peter Richtárik
TL;DR
This work addresses the communication bottleneck in Federated Learning by proposing 0.958CompressedScaffnew, a novel algorithm that jointly leverages Local Training and Communication Compression. It provides a theoretical framework showing linear convergence to the exact solution in strongly convex settings with a doubly accelerated rate, and derives iteration and total communication complexities that beat prior LT or CC methods. The approach introduces two randomization mechanisms and a tailored compressor design to effectively merge LT and CC, supported by convex-case sublinearergodic results and practical logistic regression experiments. The results have practical impact by enabling faster, communication-efficient FL under asymmetric uplink/downlink costs and diverse model dimensions. Future work may extend to stochastic gradients, partial participation, biased quantization, and nonconvex regimes.
Abstract
In federated learning, a large number of users are involved in a global learning task, in a collaborative way. They alternate local computations and two-way communication with a distant orchestrating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce the communication load and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose the first algorithm for distributed optimization and federated learning, which harnesses these two strategies jointly and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: our algorithm benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
