Decentralized Personalized Federated Learning based on a Conditional Sparse-to-Sparser Scheme

Qianyu Long; Qiyuan Wang; Christos Anagnostopoulos; Daning Bi

Decentralized Personalized Federated Learning based on a Conditional Sparse-to-Sparser Scheme

Qianyu Long, Qiyuan Wang, Christos Anagnostopoulos, Daning Bi

TL;DR

This work introduces DA-DPFL, a novel sparse-to-sparser training scheme that initializes with a subset of model parameters which progressively decrease during training through dynamic aggregation, which substantially reduces energy consumption while preserving adequate information during critical learning periods.

Abstract

Decentralized Federated Learning (DFL) has become popular due to its robustness and avoidance of centralized coordination. In this paradigm, clients actively engage in training by exchanging models with their networked neighbors. However, DFL introduces increased costs in terms of training and communication. Existing methods focus on minimizing communication often overlooking training efficiency and data heterogeneity. To address this gap, we propose a novel \textit{sparse-to-sparser} training scheme: DA-DPFL. DA-DPFL initializes with a subset of model parameters, which progressively reduces during training via \textit{dynamic aggregation} and leads to substantial energy savings while retaining adequate information during critical learning periods. Our experiments showcase that DA-DPFL substantially outperforms DFL baselines in test accuracy, while achieving up to $5$ times reduction in energy costs. We provide a theoretical analysis of DA-DPFL's convergence by solidifying its applicability in decentralized and personalized learning. The code is available at:https://github.com/EricLoong/da-dpfl

Decentralized Personalized Federated Learning based on a Conditional Sparse-to-Sparser Scheme

TL;DR

Abstract

times reduction in energy costs. We provide a theoretical analysis of DA-DPFL's convergence by solidifying its applicability in decentralized and personalized learning. The code is available at:https://github.com/EricLoong/da-dpfl

Paper Structure (31 sections, 6 theorems, 44 equations, 9 figures, 3 tables, 3 algorithms)

This paper contains 31 sections, 6 theorems, 44 equations, 9 figures, 3 tables, 3 algorithms.

Introduction
Related Work
Problem Fundamentals & Preliminaries
The DA-DPFL framework
Overview
Learning Scheduling Policy
Time-optimized Dynamic Pruning Policy
Masked-based Model Aggregation
Theoretical Analysis
Experiments
Experimental Setup
Datasets & Models
Baselines
System Configuration
Hyperparameters
...and 16 more sections

Key Result

Proposition 1

Assume $\mathcal{K} = \{1, 2, \ldots, K\}$ clients, then, exactly $m$ neighbors in $\mathcal{N}_{k}^{t}$ have reuse index less than $k$ follows a hypergeometric distribution with where $m<M$ and $|\mathcal{N}_{(a)k}|$ is subset of $\mathcal{N}_{k}^{t}$ with index less than $k$.

Figures (9)

Figure 1: (Top) Client network ($K=6, M=2,N=1$) with reuse indexes$\mathcal{N}_{k(a)}^{(*)t}$ and $\mathcal{N}_{k(b)}^{(*)t}$. Learning schedule: while $N=0$, all nodes train in parallel, i.e., $\mathcal{N}_{k(a)}^{(*)t}=\emptyset$; if $N=1$, node 3 waits for 2, node 5 and 6 for 1; nodes 1, 2, 4 begin parallel training immediately; $N=2$ enables node 6 wait for $1$ and 5, marked with different color. (Bottom) Training process at time $t$ for client $k$. Flow follows $t\in \mathcal{T}$: 'no' leads to normal sparse training, 'yes' to proposed sparser training. Steps: (1) Detection score calculation using $\omega_{k}^{t}$, determining $t^{*}$; (2) Magnitude-based weight pruning; (3) Gradient-flow-driven weight recovery; (4) PQI evaluation for NN compressibility; (5) Additional pruning based on compressibility level.
Figure 2: Test (top-1) accuracy of all baselines, including CFLs and DFLs, across various model architectures and datasets.
Figure 3: Total cost (energy and time cost, in USD) of DA-DPFL and all baselines evaluated on CIFAR10 against $\theta$.
Figure 4: Total cost (energy and time cost, in USD) of DA-DPFL and all baselines evaluated on CIFAR100 against $\theta$.
Figure 5: (Top) Relationship between sparsity and detection score; (Bottom) Impact of $M$ involved in each training round on accuracy (CIFAR10, Dir(0.3), $\delta_{pr}=0.03$).
...and 4 more figures

Theorems & Definitions (16)

Remark 1
Remark 2
Proposition 1
proof
Theorem 3
proof
Remark 4
Remark 5
Lemma 1
proof
...and 6 more

Decentralized Personalized Federated Learning based on a Conditional Sparse-to-Sparser Scheme

TL;DR

Abstract

Decentralized Personalized Federated Learning based on a Conditional Sparse-to-Sparser Scheme

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (16)