Table of Contents
Fetching ...

Internal Cross-layer Gradients for Extending Homogeneity to Heterogeneity in Federated Learning

Yun-Hin Chan, Rui Zhou, Running Zhao, Zhihan Jiang, Edith C. -H. Ngai

TL;DR

Federated learning must contend with system heterogeneity across clients, which hampers the performance of model-homogeneous FL methods. The authors introduce InCo Aggregation, a server-side strategy that leverages internal cross-layer gradients by mixing shallow and deep layer gradients, applying gradient normalization, and solving a convex optimization to align gradient directions, thereby enhancing deep-layer similarity without extra client communication. They establish non-convex convergence and rate guarantees and demonstrate broad empirical gains across CNNs (ResNets) and transformers (ViTs), improving both traditional homogeneous baselines and heterogeneous FL methods. The approach offers a practical, scalable pathway to robust FL under realistic heterogeneity, with minimal overhead and strong applicability to common architectures.

Abstract

Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverages internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogeneous FL.

Internal Cross-layer Gradients for Extending Homogeneity to Heterogeneity in Federated Learning

TL;DR

Federated learning must contend with system heterogeneity across clients, which hampers the performance of model-homogeneous FL methods. The authors introduce InCo Aggregation, a server-side strategy that leverages internal cross-layer gradients by mixing shallow and deep layer gradients, applying gradient normalization, and solving a convex optimization to align gradient directions, thereby enhancing deep-layer similarity without extra client communication. They establish non-convex convergence and rate guarantees and demonstrate broad empirical gains across CNNs (ResNets) and transformers (ViTs), improving both traditional homogeneous baselines and heterogeneous FL methods. The approach offers a practical, scalable pathway to robust FL under realistic heterogeneity, with minimal overhead and strong applicability to common architectures.

Abstract

Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverages internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogeneous FL.
Paper Structure (44 sections, 6 theorems, 44 equations, 25 figures, 9 tables, 1 algorithm)

This paper contains 44 sections, 6 theorems, 44 equations, 25 figures, 9 tables, 1 algorithm.

Key Result

Theorem 3.1

(Divergence alleviation). If gradients are vectors, for the layers that require cross-layer gradients, their updated gradients can be expressed as, where $\theta^t=\frac{\beta}{\alpha}$, $\alpha=(g_0^t)^Tg_0^t$ and $\beta=(g_0^t)^Tg_k^t$.

Figures (25)

  • Figure 1: CKA similarity in different environments and the relation between accuracy and CKA similarity. (a) and (b): The CKA similarity of different federated settings. (c): The positive relation between CKA and accuracy during the training process.
  • Figure 2: Cross-environment similarity and gradients distributions. (a) and (b): Similarity from Stage 2 and Stage 3. (c) and (d): The gradient distributions of Non-IID with hetero and IID with homo.
  • Figure 3: The gradient distributions from round 40 to 50 in different environments.
  • Figure 4: Cross-layer gradients for the server model in InCo.
  • Figure 5: The system architecture of three different model splitting methods: (a) layer splitting, (b) stage splitting, and (c) heterogeneous (hetero) splitting. (a): Layer splitting divides the entire model layer by layer. (b): Stage splitting separates each stage layer by layer. (c): Hetero splitting partitions the whole model in different widths and depths depending on the available resources $R_i$ of client $i$.
  • ...and 20 more figures

Theorems & Definitions (7)

  • Theorem 3.1
  • Remark 3.2
  • Theorem 4.5
  • Theorem 4.6
  • Theorem 4.7
  • Lemma D.1
  • Lemma D.2