Asynchronous Multi-Model Dynamic Federated Learning over Wireless Networks: Theory, Modeling, and Optimization

Zhan-Lun Chang; Seyyedali Hosseinalipour; Mung Chiang; Christopher G. Brinton

Asynchronous Multi-Model Dynamic Federated Learning over Wireless Networks: Theory, Modeling, and Optimization

Zhan-Lun Chang, Seyyedali Hosseinalipour, Mung Chiang, Christopher G. Brinton

TL;DR

This work addresses the challenge of performing asynchronous, multi-task federated learning over wireless networks with dynamic data statistics. It introduces DMA-FL, which uses scheduling tensors and rect functions to model device participation and data drift across tasks, and develops a convergence analysis linking these factors to learning performance. A joint resource allocation and device scheduling optimization is formulated and solved via relaxation and successive convex approximation to balance model quality and energy consumption, with convergence guarantees. Numerical experiments on MNIST, Fashion-MNIST, and SVHN demonstrate that DMA-FL achieves superior performance-energy tradeoffs compared to baseline asynchronous and synchronous FL methods, particularly under significant data drift and task heterogeneity. The approach offers a principled, scalable framework for deploying multi-task FL in practical edge networks, enabling responsive, energy-aware learning at scale.

Abstract

Federated learning (FL) has emerged as a key technique for distributed machine learning (ML). Most literature on FL has focused on ML model training for (i) a single task/model, with (ii) a synchronous scheme for updating model parameters, and (iii) a static data distribution setting across devices, which is often not realistic in practical wireless environments. To address this, we develop DMA-FL considering dynamic FL with multiple downstream tasks/models over an asynchronous model update architecture. We first characterize convergence via introducing scheduling tensors and rectangular functions to capture the impact of system parameters on learning performance. Our analysis sheds light on the joint impact of device training variables (e.g., number of local gradient descent steps), asynchronous scheduling decisions (i.e., when a device trains a task), and dynamic data drifts on the performance of ML training for different tasks. Leveraging these results, we formulate an optimization for jointly configuring resource allocation and device scheduling to strike an efficient trade-off between energy consumption and ML performance. Our solver for the resulting non-convex mixed integer program employs constraint relaxations and successive convex approximations with convergence guarantees. Through numerical experiments, we reveal that DMA-FL substantially improves the performance-efficiency tradeoff.

Asynchronous Multi-Model Dynamic Federated Learning over Wireless Networks: Theory, Modeling, and Optimization

TL;DR

Abstract

Paper Structure (43 sections, 3 theorems, 58 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 43 sections, 3 theorems, 58 equations, 11 figures, 2 tables, 1 algorithm.

Introduction
Federated Learning (FL) and Practical Considerations
Multiple Tasks/Models
Asynchronous Aggregations
Dynamic Data Statistics
Related Work
Outline and Summary of Contributions
System Model
Setup and Overview
ML Model Training
Task Formulation
Local Updates and Global Aggregations
Convergence Analysis
Assumptions and Definitions
Data Evolution and Device Scheduling
...and 28 more sections

Key Result

Lemma 1

Let $\Psi_j^{g,g'}=\mathbf{w}_j^{(g)}-\mathbf{w}_j^{(g')}$ denote the difference between two global models obtained at aggregations $g$ and $g'$, $g \leq g'$, for task $j$. Under an arbitrary device scheduling $\bm{X}$, we have where $\mathbf{a}_{i,j}^{(g)} = \eta_{j}^{(g)}\sum_{\ell = 0}^{e_{i,j}^{(g)}-1} \nabla F_{i,j}^{\mathsf{R}}(\mathbf{w}_{i,j}^{\ell, (g)})$. If $g = g'$, $\Psi_j^{g,g'} =

Figures (11)

Figure 1: Difference in architectures between single-model synchronous FL and our proposed DMA-FL methodology. In the single-model synchronous FL, the server needs to wait for all the trained local model for a single task before it can perform model aggregation, after which the new global model is broadcast back to all devices. In contrast, for DMA-FL, the server performs the global aggregation instantly when receiving one trained local model for any task. The new global model for that task can be transmitted to one or more devices which train model for that task.
Figure 2: Example timeline of local periods for task $j$ at two devices $i$ and $i'$. Since we consider asynchronous FL, the server updates the global model from $\bm{w}_j^{(g)}$ to $\bm{w}_j^{(g+1)}$ for any $g \in \mathcal{G}_j$ whenever it receives any trained local model. The definition of each local period encompasses the time span between two consecutive uplink transmissions. It comprises four distinct periods, namely the idle period, downlink transmission period, local computation period, and uplink transmission period.
Figure 3: Illustration of asynchronous local updates and the corresponding global aggregations at the server for multiple tasks in DMA-FL when dynamic data variations are present. The server updates the global model for task $j$ with task dependent aggregation weight $\alpha_j$ only if it receives the trained local model for task $j$. Models for other tasks at the server are not changed until its corresponding trained local model is transmitted to the server.
Figure 4: Illustration of how rectangular (rect) functions are used to capture the idle (ID) and active (AC) concept drift for different devices on a task. The aggregation index on the rect function corresponds to the index on the model received from the server.
Figure 5: ML training convergence of all schemes for all tasks, with labels distributed according to Dirichlet distribution. DMA-FL and DMA-FL-NR outperform all other baselines on all tasks. As we will see in Figure \ref{['fig:acc_energy_dirichlet']}, the DMA-FL-NR also incur significantly higher energy consumption on each task. Thus, DMA-FL has the best trade-off between the performance and energy consumption.
...and 6 more figures

Theorems & Definitions (11)

Definition 1: Weak Convexity
Definition 2
Definition 3: Device Scheduling
Definition 4: Staleness
Definition 5: Capturing Concept Drift via Rect Functions
Lemma 1: Recursive Relationship between two Global Models
proof
Theorem 1: Model Convergence
proof
Corollary 1: Special Cases of (\ref{['weighted_sum_weight_local_number_updates_concept_drift_integration_y']})
...and 1 more

Asynchronous Multi-Model Dynamic Federated Learning over Wireless Networks: Theory, Modeling, and Optimization

TL;DR

Abstract

Asynchronous Multi-Model Dynamic Federated Learning over Wireless Networks: Theory, Modeling, and Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (11)