Table of Contents
Fetching ...

Algorithms for Collaborative Machine Learning under Statistical Heterogeneity

Seok-Ju Hahn

TL;DR

This doctoral work tackles federated learning under statistical heterogeneity by addressing three core angles: parameterization, adaptive aggregation, and local data distributions. It introduces SuPerFed to create a connected low-loss subspace between global and local models for personalization, aided by proximity and orthogonality regularizers. It then presents AAggFF, an online decision-making framework that adaptively tunes mixing coefficients to promote client-level fairness with provable sublinear regret, including cross-silo and cross-device variants. Finally, FedEvg leverages energy-based models to generate collaborative synthetic data without sharing raw data or full model parameters, by refining server-side synthetic samples through MCMC-like updates using energy signals from clients. Collectively, the work advances practical, scalable solutions to statistical heterogeneity in distributed learning, with implications for personalized FL, fair aggregation, and synthetic-data-enabled collaboration. The proposed methods advance FL practicality by enabling personalized performance, equitable client outcomes, and data-efficient server-side augmentation, while preserving privacy and reducing communication overhead.

Abstract

Learning from distributed data without accessing them is undoubtedly a challenging and non-trivial task. Nevertheless, the necessity for distributed training of a statistical model has been increasing, due to the privacy concerns of local data owners and the cost in centralizing the massively distributed data. Federated learning (FL) is currently the de facto standard of training a machine learning model across heterogeneous data owners, without leaving the raw data out of local silos. Nevertheless, several challenges must be addressed in order for FL to be more practical in reality. Among these challenges, the statistical heterogeneity problem is the most significant and requires immediate attention. From the main objective of FL, three major factors can be considered as starting points -- \textit{parameter}, textit{mixing coefficient}, and \textit{local data distributions}. In alignment with the components, this dissertation is organized into three parts. In Chapter II, a novel personalization method, \texttt{SuPerFed}, inspired by the mode-connectivity is introduced. In Chapter III, an adaptive decision-making algorithm, \texttt{AAggFF}, is introduced for inducing uniform performance distributions in participating clients, which is realized by online convex optimization framework. Finally, in Chapter IV, a collaborative synthetic data generation method, \texttt{FedEvg}, is introduced, leveraging the flexibility and compositionality of an energy-based modeling approach. Taken together, all of these approaches provide practical solutions to mitigate the statistical heterogeneity problem in data-decentralized settings, paving the way for distributed systems and applications using collaborative machine learning methods.

Algorithms for Collaborative Machine Learning under Statistical Heterogeneity

TL;DR

This doctoral work tackles federated learning under statistical heterogeneity by addressing three core angles: parameterization, adaptive aggregation, and local data distributions. It introduces SuPerFed to create a connected low-loss subspace between global and local models for personalization, aided by proximity and orthogonality regularizers. It then presents AAggFF, an online decision-making framework that adaptively tunes mixing coefficients to promote client-level fairness with provable sublinear regret, including cross-silo and cross-device variants. Finally, FedEvg leverages energy-based models to generate collaborative synthetic data without sharing raw data or full model parameters, by refining server-side synthetic samples through MCMC-like updates using energy signals from clients. Collectively, the work advances practical, scalable solutions to statistical heterogeneity in distributed learning, with implications for personalized FL, fair aggregation, and synthetic-data-enabled collaboration. The proposed methods advance FL practicality by enabling personalized performance, equitable client outcomes, and data-efficient server-side augmentation, while preserving privacy and reducing communication overhead.

Abstract

Learning from distributed data without accessing them is undoubtedly a challenging and non-trivial task. Nevertheless, the necessity for distributed training of a statistical model has been increasing, due to the privacy concerns of local data owners and the cost in centralizing the massively distributed data. Federated learning (FL) is currently the de facto standard of training a machine learning model across heterogeneous data owners, without leaving the raw data out of local silos. Nevertheless, several challenges must be addressed in order for FL to be more practical in reality. Among these challenges, the statistical heterogeneity problem is the most significant and requires immediate attention. From the main objective of FL, three major factors can be considered as starting points -- \textit{parameter}, textit{mixing coefficient}, and \textit{local data distributions}. In alignment with the components, this dissertation is organized into three parts. In Chapter II, a novel personalization method, \texttt{SuPerFed}, inspired by the mode-connectivity is introduced. In Chapter III, an adaptive decision-making algorithm, \texttt{AAggFF}, is introduced for inducing uniform performance distributions in participating clients, which is realized by online convex optimization framework. Finally, in Chapter IV, a collaborative synthetic data generation method, \texttt{FedEvg}, is introduced, leveraging the flexibility and compositionality of an energy-based modeling approach. Taken together, all of these approaches provide practical solutions to mitigate the statistical heterogeneity problem in data-decentralized settings, paving the way for distributed systems and applications using collaborative machine learning methods.
Paper Structure (131 sections, 14 theorems, 120 equations, 4 figures, 26 tables, 9 algorithms)

This paper contains 131 sections, 14 theorems, 120 equations, 4 figures, 26 tables, 9 algorithms.

Key Result

Lemma 3.1

For all $t\in[T]$, suppose each entry of a response vector $\boldsymbol{r}^{(t)}\in\mathbb{R}^K$ is bounded as $r_i^{(t)}\in[C_1,C_2]$ for some constants $C_1$ and $C_2$ satisfying $0<C_1<C_2$. Then, the decision loss $\ell^{(t)}$ defined in (eq:decision_loss) is $\frac{C_2}{1+C_1}$-Lipschitz contin

Figures (4)

  • Figure 1: Overview of a one step of federated learning procedure. The central server broadcasts a global model to a random subset of clients (left). Each client updates the global model using its own local dataset and computing power, and the local updates are then uploaded to the server to be aggregated into a new global model (right). This process is repeated until a global model converges, orchestrated by the server.
  • Figure 2.2: Overview of the model mixture-based personalized federated learning method
  • Figure 2.3: Comparison of the personalization performance between APFL (left), SuPerFed-MM (middle), and SuPerFed-LM (right) by varying $\lambda\in[0,1]$, trained using MNIST with TwoNN under pathological non-IID setting with $K=500$ and $T=500$
  • Figure 3.1: Cumulative values of a global objective according to different CDFs (smaller is better): (Left) Berka dataset (cross-silo setting; $K=7, T=100$). (Right) Reddit dataset (cross-device setting; $K=817, T=300, C=0.00612$)

Theorems & Definitions (39)

  • Definition 1
  • Definition 3.2
  • Remark 3.1
  • Definition 3.3
  • Remark 3.2
  • Lemma 3.1
  • Definition 3.4
  • Lemma 3.2
  • Lemma 3.3
  • Remark 3.3
  • ...and 29 more