Towards a Better Theoretical Understanding of Independent Subnetwork Training

Egor Shulgin; Peter Richtárik

Towards a Better Theoretical Understanding of Independent Subnetwork Training

Egor Shulgin, Peter Richtárik

TL;DR

This work analyzes Independent Subnetwork Training (IST), a framework that combines data- and subnetwork-parallelism to train large networks by operating on sparse submodels. It provides a rigorous analysis for IST on a quadratic surrogate in both homogeneous and heterogeneous settings, revealing an irreducible bias in non-interpolation regimes and deriving non-asymptotic convergence bounds under a flexible permutation-sketch preconditioning. The results show that, while IST can achieve descent rates similar to gradient methods under favorable preconditioning, heterogeneity introduces a persistent neighborhood around the optimum, with the neighborhood size governed by the bias and variance terms. Empirically, the study validates the theory and contrasts IST with standard distributed gradient methods, highlighting practical trade-offs in communication efficiency and convergence behavior for cross-device and Federated contexts.

Abstract

Modern advancements in large-scale machine learning would be impossible without the paradigm of data-parallel distributed computing. Since distributed computing with large-scale models imparts excessive pressure on communication channels, significant recent research has been directed toward co-designing communication compression strategies and training algorithms with the goal of reducing communication costs. While pure data parallelism allows better data scaling, it suffers from poor model scaling properties. Indeed, compute nodes are severely limited by memory constraints, preventing further increases in model size. For this reason, the latest achievements in training giant neural network models also rely on some form of model parallelism. In this work, we take a closer theoretical look at Independent Subnetwork Training (IST), which is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication, and provide a precise analysis of its optimization performance on a quadratic model.

Towards a Better Theoretical Understanding of Independent Subnetwork Training

TL;DR

Abstract

Paper Structure (42 sections, 9 theorems, 132 equations, 3 figures)

This paper contains 42 sections, 9 theorems, 132 equations, 3 figures.

Introduction
The need for model parallelism
Summary of contributions
Formalism and setup
Issues with existing approaches
Simplifications taken
Results in the interpolation case
Homogeneous problem preconditioning
Heterogeneous sketch preconditioning
Irreducible bias in the general case
Bias of the method
Generic convergence analysis
Homogeneous case.
Comparison to previous works
Independent Subnetwork Training yuan2022distributed.
...and 27 more sections

Key Result

Theorem 1

Consider the method eq:SGD_generic with estimator eq:het_estimator for a quadratic problem eq:het_gen_quad_problem with $\mathop{\mathrm{\overline{\mathbf{L}}}}\nolimits \succ 0$ and $\mathop{\mathrm{\mathrm{b}}}\nolimits_i \equiv 0$. Then if $\mathop{\mathrm{\overline{\mathbf{W}}}}\nolimits \colone and the step size is chosen as $0 < \gamma \leq \frac{1}{\theta}$, the iterates satisfy and

Figures (3)

Figure 1: Performance of simplified IST on quadratic problem for varying step size values.
Figure 2: Experimental study of IST on a neural network problem.
Figure 3: Schematic depiction of a Neural Network trained with IST across two nodes. Source: yuan2022distributed.

Theorems & Definitions (19)

Definition 1: Unbiased compressor
Definition 2: Permutation sketch
Theorem 1
Remark 1
Theorem 2
Lemma 1: Fenchel–Young inequality
proof
Theorem 3
proof
Theorem 4
...and 9 more

Towards a Better Theoretical Understanding of Independent Subnetwork Training

TL;DR

Abstract

Towards a Better Theoretical Understanding of Independent Subnetwork Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (19)