Table of Contents
Fetching ...

Federated Learning over Hierarchical Wireless Networks: Training Latency Minimization via Submodel Partitioning

Wenzhi Fang, Dong-Jun Han, Christopher G. Brinton

TL;DR

This paper tackles the scalability and latency challenges of hierarchical federated learning on resource-constrained wireless networks by introducing HIST, which partitions the global model into per-round submodels trained by distinct cell groups. It provides convergence guarantees for non-convex loss under non-i.i.d. data, derives a latency-aware submodel partitioning strategy, and extends the framework with AirComp to further reduce edge aggregation latency. The authors validate HIST on fully connected and convolutional networks, showing substantial reductions in training time and communication cost while maintaining accuracy, with AirComp-HIST offering additional latency gains under realistic wireless conditions. The work advances practical FL in multi-layer networks and opens paths to applying submodel partitioning to transformer-based fine-tuning with LoRA in edge settings.

Abstract

Hierarchical federated learning (HFL) has demonstrated promising scalability advantages over the traditional "star-topology" architecture-based federated learning (FL). However, HFL still imposes significant computation, communication, and storage burdens on the edge, especially when training a large-scale model over resource-constrained wireless devices. In this paper, we propose hierarchical independent submodel training (HIST), a new FL methodology that aims to address these issues in hierarchical cloud-edge-client networks. The key idea behind HIST is to divide the global model into disjoint partitions (or submodels) per round so that each group of clients (i.e., cells) is responsible for training only one partition of the model. We characterize the convergence behavior of HIST under mild assumptions, showing the impacts of several key attributes (e.g., submodel sizes, number of cells, edge and global aggregation frequencies) on the rate and stationarity gap. Building upon the theoretical results, we propose a submodel partitioning strategy to minimize the training latency depending on network resource availability and a target learning performance guarantee. We then demonstrate how HIST can be augmented with over-the-air computation (AirComp) to further enhance the efficiency of the model aggregation over the edge cells. Through numerical evaluations, we verify that HIST is able to save training time and communication costs by wide margins while achieving comparable accuracy as conventional HFL. Moreover, our experiments demonstrate that AirComp-assisted HIST provides further improvements in training latency.

Federated Learning over Hierarchical Wireless Networks: Training Latency Minimization via Submodel Partitioning

TL;DR

This paper tackles the scalability and latency challenges of hierarchical federated learning on resource-constrained wireless networks by introducing HIST, which partitions the global model into per-round submodels trained by distinct cell groups. It provides convergence guarantees for non-convex loss under non-i.i.d. data, derives a latency-aware submodel partitioning strategy, and extends the framework with AirComp to further reduce edge aggregation latency. The authors validate HIST on fully connected and convolutional networks, showing substantial reductions in training time and communication cost while maintaining accuracy, with AirComp-HIST offering additional latency gains under realistic wireless conditions. The work advances practical FL in multi-layer networks and opens paths to applying submodel partitioning to transformer-based fine-tuning with LoRA in edge settings.

Abstract

Hierarchical federated learning (HFL) has demonstrated promising scalability advantages over the traditional "star-topology" architecture-based federated learning (FL). However, HFL still imposes significant computation, communication, and storage burdens on the edge, especially when training a large-scale model over resource-constrained wireless devices. In this paper, we propose hierarchical independent submodel training (HIST), a new FL methodology that aims to address these issues in hierarchical cloud-edge-client networks. The key idea behind HIST is to divide the global model into disjoint partitions (or submodels) per round so that each group of clients (i.e., cells) is responsible for training only one partition of the model. We characterize the convergence behavior of HIST under mild assumptions, showing the impacts of several key attributes (e.g., submodel sizes, number of cells, edge and global aggregation frequencies) on the rate and stationarity gap. Building upon the theoretical results, we propose a submodel partitioning strategy to minimize the training latency depending on network resource availability and a target learning performance guarantee. We then demonstrate how HIST can be augmented with over-the-air computation (AirComp) to further enhance the efficiency of the model aggregation over the edge cells. Through numerical evaluations, we verify that HIST is able to save training time and communication costs by wide margins while achieving comparable accuracy as conventional HFL. Moreover, our experiments demonstrate that AirComp-assisted HIST provides further improvements in training latency.
Paper Structure (41 sections, 12 theorems, 95 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 41 sections, 12 theorems, 95 equations, 9 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Suppose that Assumptions assump_lowerbound-assump_gradient_dissimilarity_client hold, $N\geq 2$, and the step size satisfies Then, for an arbitrary mask partitioning satisfying mask_IST in each iteration, the HIST algorithm under full client participation satisfies where $\tilde{N} = \sum_{j=1}^N \frac{1}{n_j}$ and

Figures (9)

  • Figure 1: Example comparison of partition strategies for a fully connected neural network with multiple hidden layers.
  • Figure 2: Overview of the proposed HIST algorithm. Each cell is responsible for training only a specific partition of the full model in each global round, with the specific submodel partitioning changing over each round.
  • Figure 3: Visualization of the AirComp-assisted HIST algorithm. The mask partitioning is solved once per global round, while the beamforming optimization is solved once per edge round.
  • Figure 4: The impact of the number of cells $N$ on the convergence performance of HIST.
  • Figure 5: Communication cost for achieving the testing accuracy of $80\%$ in each scheme.
  • ...and 4 more figures

Theorems & Definitions (22)

  • Remark 1
  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Remark 2
  • Theorem 2
  • proof
  • Remark 3
  • Remark 4
  • ...and 12 more