Tight Time Complexities in Parallel Stochastic Optimization with Arbitrary Computation Dynamics

Alexander Tyurin

Tight Time Complexities in Parallel Stochastic Optimization with Arbitrary Computation Dynamics

Alexander Tyurin

TL;DR

This work introduces a universal computation model to capture arbitrary, time-varying computation dynamics across workers in distributed stochastic optimization. It derives tight time complexity lower bounds for both homogeneous and heterogeneous settings and proves that Rennala SGD and Malenia SGD achieve these bounds up to constants, thereby establishing their optimality in respective regimes. The results extend previous time-based analyses to a broad class of realistic HPC environments and provide explicit time formulas in several scenarios, including fixed computation and nonlinear trends. The framework unifies and extends existing lower/upper bounds, offering a principled basis for designing robust asynchronous optimization systems in the presence of outages and heterogeneity.

Abstract

In distributed stochastic optimization, where parallel and asynchronous methods are employed, we establish optimal time complexities under virtually any computation behavior of workers/devices/CPUs/GPUs, capturing potential disconnections due to hardware and network delays, time-varying computation powers, and any possible fluctuations and trends of computation speeds. These real-world scenarios are formalized by our new universal computation model. Leveraging this model and new proof techniques, we discover tight lower bounds that apply to virtually all synchronous and asynchronous methods, including Minibatch SGD, Asynchronous SGD (Recht et al., 2011), and Picky SGD (Cohen et al., 2021). We show that these lower bounds, up to constant factors, are matched by the optimal Rennala SGD and Malenia SGD methods (Tyurin & Richtárik, 2023).

Tight Time Complexities in Parallel Stochastic Optimization with Arbitrary Computation Dynamics

TL;DR

Abstract

Paper Structure (28 sections, 29 theorems, 228 equations, 3 figures, 6 algorithms)

This paper contains 28 sections, 29 theorems, 228 equations, 3 figures, 6 algorithms.

Introduction
Problem setup
Related Work
Contributions
Universal Computation Model
Preliminaries
Homogeneous Setup
Optimal algorithm
Heterogeneous Setup
Optimal method
Conclusion
Proof Techniques
Proof techniques in the homogeneous setup
Proof techniques in the heterogeneous setup
Proof of Theorem \ref{['cor:max_time_params']}
...and 13 more sections

Key Result

Theorem 3.3

For all $i \in [n],$$V_i$ is continuous and non-decreasing on $\mathbb{R}_{+}$ if $v_i$ is non-negative continuous almost everywhere (Assumption ass:speed).

Figures (3)

Figure 1: Fixed Computation Model: The previous computation paradigm mishchenko2022asynchronous assumes that the performances/powers of the workers remain constant over time. tyurin2023optimal established the optimal time complexities (\ref{['eq:NmWzjXHGl']}) and (\ref{['eq:opt_heter_fixed']}) for this paradigm.
Figure 2: Universal Computation Model: A new computation paradigm that captures virtually all possible computation scenarios. The three subplots present illustrative and non-exhaustive examples of irregular $\{v_i\}$ (Fig. \ref{['fig:1']}), periodic noisy powers $\{v_i\}$ (Fig. \ref{['fig:2']}), and random outages of the workers, where $v_i$ equals $0$ periodically (Fig. \ref{['fig:3']}). For all possible scenarios, we establish optimal time complexities (see Theorems \ref{['theorem:random_lower_bound']}, \ref{['cor:max_time_params']}, \ref{['theorem:random_lower_bound_heter_random']}, and \ref{['cor:max_time_params_heter']}). It is possible to get interpretable and explicit formulas for the optimal time complexities in some scenarios (see Examples \ref{['ex:simple']}, \ref{['ex:simple_compl']}, \ref{['eq:example_share']}, \ref{['eq:simple_heter']}, and \ref{['eq:example_share_heter']}). However, for Fig. \ref{['fig:1']}, Fig. \ref{['fig:2']}, and Fig. \ref{['fig:3']}, it is arguably intractable to find $\bar{t}_{\left\lceil L \Delta / \varepsilon\right\rceil}$ analytically. Instead, we can easily do it numerically in Fig. \ref{['fig:1']} and get the optimal time complexities $6.57$ and $13.02$ sec with $L \Delta / \varepsilon = 10$ and $\sigma^2 / \varepsilon = 100$ in the homogeneous and heterogeneous settings, respectively (Fig. \ref{['fig:2']}: $2.34$ and $2.53$ sec; Fig. \ref{['fig:3']}: $77.04$ and $84.62$ sec).
Figure : Rennala SGD

Theorems & Definitions (60)

Example 3.2: Fixed Computation Model
Theorem 3.3: e.g. bartle2000introduction
Definition 4.1: Algorithm Class $\mathcal{A}_{\textnormal{zr}}$
Theorem : Informal Formulation of Theorem \ref{['theorem:random_lower_bound']}
Theorem 5.1
Theorem 5.2
Theorem 5.3
Example 5.3
Example 5.3
Example 5.3
...and 50 more

Tight Time Complexities in Parallel Stochastic Optimization with Arbitrary Computation Dynamics

TL;DR

Abstract

Tight Time Complexities in Parallel Stochastic Optimization with Arbitrary Computation Dynamics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (60)