Table of Contents
Fetching ...

Federated Optimization:Distributed Optimization Beyond the Datacenter

Jakub Konečný, Brendan McMahan, Daniel Ramage

TL;DR

Federated optimization addresses training a centralized model using data distributed across $K$ nodes with non-IID and unbalanced distributions by minimizing $f(w)=\frac{1}{n}\sum_{i=1}^n f_i(w)$ and leveraging $F_k(w)=\frac{1}{n_k}\sum_{i\in \mathcal{P}_k} f_i(w)$. The paper introduces a distributed SVRG variant (DSVRG) that uses node-specific stepsizes $h_k$, diagonal scaling $S_k$, and adaptive aggregation matrix $A$ to cope with sparsity and heterogeneity, bridging SVRG and DANE. In experiments on a large-scale, highly non-IID setting (e.g., $K=10^4$, $n\approx 2.17\times 10^6$), DSVRG achieves rapid convergence in few communication rounds and outperforms existing methods, with robustness to non-IID partitions. The work highlights practical implications, calls for public datasets with user-level structure and theoretical convergence guarantees, and discusses privacy considerations for on-device learning.

Abstract

We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large number of \nodes, but the goal remains to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of utmost importance. A motivating example for federated optimization arises when we keep the training data locally on users' mobile devices rather than logging it to a data center for training. Instead, the mobile devices are used as nodes performing computation on their local data in order to update a global model. We suppose that we have an extremely large number of devices in our network, each of which has only a tiny fraction of data available totally; in particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, we assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results. This work also sets a path for future research needed in the context of federated optimization.

Federated Optimization:Distributed Optimization Beyond the Datacenter

TL;DR

Federated optimization addresses training a centralized model using data distributed across nodes with non-IID and unbalanced distributions by minimizing and leveraging . The paper introduces a distributed SVRG variant (DSVRG) that uses node-specific stepsizes , diagonal scaling , and adaptive aggregation matrix to cope with sparsity and heterogeneity, bridging SVRG and DANE. In experiments on a large-scale, highly non-IID setting (e.g., , ), DSVRG achieves rapid convergence in few communication rounds and outperforms existing methods, with robustness to non-IID partitions. The work highlights practical implications, calls for public datasets with user-level structure and theoretical convergence guarantees, and discusses privacy considerations for on-device learning.

Abstract

We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large number of \nodes, but the goal remains to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of utmost importance. A motivating example for federated optimization arises when we keep the training data locally on users' mobile devices rather than logging it to a data center for training. Instead, the mobile devices are used as nodes performing computation on their local data in order to update a global model. We suppose that we have an extremely large number of devices in our network, each of which has only a tiny fraction of data available totally; in particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, we assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results. This work also sets a path for future research needed in the context of federated optimization.

Paper Structure

This paper contains 4 sections, 2 equations, 1 figure, 1 algorithm.

Figures (1)

  • Figure 1: Rounds of communication vs. objective function (left) and test prediction error (right).