Table of Contents
Fetching ...

Design and Optimization of Hierarchical Gradient Coding for Distributed Learning at Edge Devices

Weiheng Tang, Jingyi Li, Lin Chen, Xu Chen

TL;DR

The paper tackles straggler-induced delays in hierarchical edge-enabled distributed learning by deriving a fundamental load–redundancy trade-off $\frac{D}{K} \geq \frac{(s_e+1)(s_w+1)}{\sum_{i=1}^n m_i}$ and introducing a two-layer hierarchical gradient coding scheme that enables edge nodes and the master to collaboratively recover the full gradient. It then formulates and solves a runtime-optimization problem (JNCSS) to minimize per-iteration time in heterogeneous settings, providing a greedy algorithm with a performance bound. The proposed Hierarchical Gradient Coding (HGC) and HGC-JNCSS schemes show substantial runtime reductions (up to around 60% in some scenarios) and reduced master communication load, outperforming conventional baselines and uncoded schemes. Overall, the framework advances practical, scalable, and heterogeneity-aware gradient coding for edge-assisted distributed learning.

Abstract

Edge computing has recently emerged as a promising paradigm to boost the performance of distributed learning by leveraging the distributed resources at edge nodes. Architecturally, the introduction of edge nodes adds an additional intermediate layer between the master and workers in the original distributed learning systems, potentially leading to more severe straggler effect. Recently, coding theory-based approaches have been proposed for stragglers mitigation in distributed learning, but the majority focus on the conventional workers-master architecture. In this paper, along a different line, we investigate the problem of mitigating the straggler effect in hierarchical distributed learning systems with an additional layer composed of edge nodes. Technically, we first derive the fundamental trade-off between the computational loads of workers and the stragglers tolerance. Then, we propose a hierarchical gradient coding framework, which provides better stragglers mitigation, to achieve the derived computational trade-off. To further improve the performance of our framework in heterogeneous scenarios, we formulate an optimization problem with the objective of minimizing the expected execution time for each iteration in the learning process. We develop an efficient algorithm to mathematically solve the problem by outputting the optimum strategy. Extensive simulation results demonstrate the superiority of our schemes compared with conventional solutions.

Design and Optimization of Hierarchical Gradient Coding for Distributed Learning at Edge Devices

TL;DR

The paper tackles straggler-induced delays in hierarchical edge-enabled distributed learning by deriving a fundamental load–redundancy trade-off and introducing a two-layer hierarchical gradient coding scheme that enables edge nodes and the master to collaboratively recover the full gradient. It then formulates and solves a runtime-optimization problem (JNCSS) to minimize per-iteration time in heterogeneous settings, providing a greedy algorithm with a performance bound. The proposed Hierarchical Gradient Coding (HGC) and HGC-JNCSS schemes show substantial runtime reductions (up to around 60% in some scenarios) and reduced master communication load, outperforming conventional baselines and uncoded schemes. Overall, the framework advances practical, scalable, and heterogeneity-aware gradient coding for edge-assisted distributed learning.

Abstract

Edge computing has recently emerged as a promising paradigm to boost the performance of distributed learning by leveraging the distributed resources at edge nodes. Architecturally, the introduction of edge nodes adds an additional intermediate layer between the master and workers in the original distributed learning systems, potentially leading to more severe straggler effect. Recently, coding theory-based approaches have been proposed for stragglers mitigation in distributed learning, but the majority focus on the conventional workers-master architecture. In this paper, along a different line, we investigate the problem of mitigating the straggler effect in hierarchical distributed learning systems with an additional layer composed of edge nodes. Technically, we first derive the fundamental trade-off between the computational loads of workers and the stragglers tolerance. Then, we propose a hierarchical gradient coding framework, which provides better stragglers mitigation, to achieve the derived computational trade-off. To further improve the performance of our framework in heterogeneous scenarios, we formulate an optimization problem with the objective of minimizing the expected execution time for each iteration in the learning process. We develop an efficient algorithm to mathematically solve the problem by outputting the optimum strategy. Extensive simulation results demonstrate the superiority of our schemes compared with conventional solutions.
Paper Structure (19 sections, 6 theorems, 71 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 19 sections, 6 theorems, 71 equations, 8 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Given a hierarchical distributed learning system with $n$ edge nodes, the $i$-th edge node $E_{i}$ connects to $m_{i}$ workers and $m=\min_{i} m_{i}$. Every worker will train $D$ of $K$ disjoint sub-datasets. To tolerate any $s_{e}\in[0\colon n)$ edge stragglers and any $s_{w}\in[0\colon m)$ straggl

Figures (8)

  • Figure 1: Illustration of a hierarchical distributed learning system.
  • Figure 2: Illustration of the stragglers in workers and edge nodes.
  • Figure 3: A hierarchical distributed system with $3$ edge nodes $E_{1},E_{2},E_{3}$, each of them connects to $3$ workers separately and interacts to a same master. All of the data will be divided into 9 disjoint sub-datasets.
  • Figure 4: A hierarchical distributed system with straggling edge node $E_{3}$, worker $W_{(1,3)}$, and worker $W_{(2,3)}$.
  • Figure 5: Test accuracy curves with respect to training iterations for different datasets and data non-IID levels.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Corollary 2
  • Theorem 2
  • Theorem 3
  • Lemma 1