Design and Optimization of Hierarchical Gradient Coding for Distributed Learning at Edge Devices
Weiheng Tang, Jingyi Li, Lin Chen, Xu Chen
TL;DR
The paper tackles straggler-induced delays in hierarchical edge-enabled distributed learning by deriving a fundamental load–redundancy trade-off $\frac{D}{K} \geq \frac{(s_e+1)(s_w+1)}{\sum_{i=1}^n m_i}$ and introducing a two-layer hierarchical gradient coding scheme that enables edge nodes and the master to collaboratively recover the full gradient. It then formulates and solves a runtime-optimization problem (JNCSS) to minimize per-iteration time in heterogeneous settings, providing a greedy algorithm with a performance bound. The proposed Hierarchical Gradient Coding (HGC) and HGC-JNCSS schemes show substantial runtime reductions (up to around 60% in some scenarios) and reduced master communication load, outperforming conventional baselines and uncoded schemes. Overall, the framework advances practical, scalable, and heterogeneity-aware gradient coding for edge-assisted distributed learning.
Abstract
Edge computing has recently emerged as a promising paradigm to boost the performance of distributed learning by leveraging the distributed resources at edge nodes. Architecturally, the introduction of edge nodes adds an additional intermediate layer between the master and workers in the original distributed learning systems, potentially leading to more severe straggler effect. Recently, coding theory-based approaches have been proposed for stragglers mitigation in distributed learning, but the majority focus on the conventional workers-master architecture. In this paper, along a different line, we investigate the problem of mitigating the straggler effect in hierarchical distributed learning systems with an additional layer composed of edge nodes. Technically, we first derive the fundamental trade-off between the computational loads of workers and the stragglers tolerance. Then, we propose a hierarchical gradient coding framework, which provides better stragglers mitigation, to achieve the derived computational trade-off. To further improve the performance of our framework in heterogeneous scenarios, we formulate an optimization problem with the objective of minimizing the expected execution time for each iteration in the learning process. We develop an efficient algorithm to mathematically solve the problem by outputting the optimum strategy. Extensive simulation results demonstrate the superiority of our schemes compared with conventional solutions.
