Agnostic Federated Learning
Mehryar Mohri, Gary Sivek, Ananda Theertha Suresh
TL;DR
Agnostic Federated Learning (AFL) addresses the mismatch between training and test distributions in federated settings by optimizing a single central model for any mixture of client distributions. The authors develop data-dependent learning bounds using a weighted Rademacher complexity with a skewness term, and they derive a convex minimax optimization solved by a scalable stochastic algorithm (Stochastic-AFL) with convergence guarantees. Empirical results on Adult, Fashion-MNIST, and language-model tasks show AFL improves worst-domain performance compared to standard FL and domain-specific baselines, and extensions to domain clustering, priors over mixture weights, and personalization are explored. Overall, AFL provides a principled, robust framework for learning under distributional shift across multiple clients, with practical applicability to cloud services and domain adaptation scenarios.
Abstract
A key learning scenario in large-scale applications is that of federated learning, where a centralized model is trained based on data originating from a large number of clients. We argue that, with the existing training and inference, federated models can be biased towards different clients. Instead, we propose a new framework of agnostic federated learning, where the centralized model is optimized for any target distribution formed by a mixture of the client distributions. We further show that this framework naturally yields a notion of fairness. We present data-dependent Rademacher complexity guarantees for learning with this objective, which guide the definition of an algorithm for agnostic federated learning. We also give a fast stochastic optimization algorithm for solving the corresponding optimization problem, for which we prove convergence bounds, assuming a convex loss function and hypothesis set. We further empirically demonstrate the benefits of our approach in several datasets. Beyond federated learning, our framework and algorithm can be of interest to other learning scenarios such as cloud computing, domain adaptation, drifting, and other contexts where the training and test distributions do not coincide.
