Table of Contents
Fetching ...

Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD

Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji

TL;DR

The paper addresses why local aggregation helps convergence in Hierarchical SGD (H-SGD) under non-IID data. It introduces upward and downward divergences to analyze how intra- and inter-group aggregations affect optimization, and derives convergence bounds for two-level and multi-level H-SGD, including random groupings. Key contributions include a novel divergence framework, sandwich-like convergence bounds that situate H-SGD between single-level local SGD bounds, and empirical validation across standard datasets. The results provide principled guidance for selecting local/global update periods and grouping strategies to balance communication efficiency with fast convergence in multi-level distributed learning systems.

Abstract

Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent research efforts, the effect of local aggregation on global convergence still lacks theoretical understanding. In this work, we first introduce a new notion of "upward" and "downward" divergences. We then use it to conduct a novel analysis to obtain a worst-case convergence upper bound for two-level H-SGD with non-IID data, non-convex objective function, and stochastic gradient. By extending this result to the case with random grouping, we observe that this convergence upper bound of H-SGD is between the upper bounds of two single-level local SGD settings, with the number of local iterations equal to the local and global update periods in H-SGD, respectively. We refer to this as the "sandwich behavior". Furthermore, we extend our analytical approach based on "upward" and "downward" divergences to study the convergence for the general case of H-SGD with more than two levels, where the "sandwich behavior" still holds. Our theoretical results provide key insights of why local aggregation can be beneficial in improving the convergence of H-SGD.

Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD

TL;DR

The paper addresses why local aggregation helps convergence in Hierarchical SGD (H-SGD) under non-IID data. It introduces upward and downward divergences to analyze how intra- and inter-group aggregations affect optimization, and derives convergence bounds for two-level and multi-level H-SGD, including random groupings. Key contributions include a novel divergence framework, sandwich-like convergence bounds that situate H-SGD between single-level local SGD bounds, and empirical validation across standard datasets. The results provide principled guidance for selecting local/global update periods and grouping strategies to balance communication efficiency with fast convergence in multi-level distributed learning systems.

Abstract

Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent research efforts, the effect of local aggregation on global convergence still lacks theoretical understanding. In this work, we first introduce a new notion of "upward" and "downward" divergences. We then use it to conduct a novel analysis to obtain a worst-case convergence upper bound for two-level H-SGD with non-IID data, non-convex objective function, and stochastic gradient. By extending this result to the case with random grouping, we observe that this convergence upper bound of H-SGD is between the upper bounds of two single-level local SGD settings, with the number of local iterations equal to the local and global update periods in H-SGD, respectively. We refer to this as the "sandwich behavior". Furthermore, we extend our analytical approach based on "upward" and "downward" divergences to study the convergence for the general case of H-SGD with more than two levels, where the "sandwich behavior" still holds. Our theoretical results provide key insights of why local aggregation can be beneficial in improving the convergence of H-SGD.

Paper Structure

This paper contains 32 sections, 8 theorems, 82 equations, 12 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Consider the problem in (eq:objective). For any fixed worker grouping that satisfies Assumption assumption:hf-sgd, if the learning rate in Algorithm $1$ satisfies $\gamma < \frac{1}{2\sqrt{6}GL}$, then for any $T\ge 1$, we have where $C=40/3$.

Figures (12)

  • Figure 1: Three-level example with $N_1=N_2=2, N_3=3$.
  • Figure 2: Test accuracy v.s. communication time ($N=2$).
  • Figure 3: Results with CIFAR-10. Test accuracy v.s. local iterations. By default, $N=2$.
  • Figure E.1: Results with FEMNIST; (a), (b) training loss v.s. communication time and iterations, (c) test accuracy v.s. iterations.
  • Figure E.2: Results with CelebA; (a), (b) training loss v.s. communication time and iterations, (c) test accuracy v.s. iterations.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Corollary 1
  • Lemma 1
  • Lemma 2
  • Theorem 2
  • Lemma 3
  • Theorem 3
  • Theorem D.1