Table of Contents
Fetching ...

Freya PAGE: First Optimal Time Complexity for Large-Scale Nonconvex Finite-Sum Optimization with Heterogeneous Asynchronous Computations

Alexander Tyurin, Kaja Gruntkowska, Peter Richtárik

TL;DR

A lower bound for smooth nonconvex finite-sum problems in the asynchronous setup is established, providing a fundamental time complexity limit and demonstrates the optimality of Freya PAGE in the large-scale regime.

Abstract

In practical distributed systems, workers are typically not homogeneous, and due to differences in hardware configurations and network conditions, can have highly varying processing times. We consider smooth nonconvex finite-sum (empirical risk minimization) problems in this setup and introduce a new parallel method, Freya PAGE, designed to handle arbitrarily heterogeneous and asynchronous computations. By being robust to "stragglers" and adaptively ignoring slow computations, Freya PAGE offers significantly improved time complexity guarantees compared to all previous methods, including Asynchronous SGD, Rennala SGD, SPIDER, and PAGE, while requiring weaker assumptions. The algorithm relies on novel generic stochastic gradient collection strategies with theoretical guarantees that can be of interest on their own, and may be used in the design of future optimization methods. Furthermore, we establish a lower bound for smooth nonconvex finite-sum problems in the asynchronous setup, providing a fundamental time complexity limit. This lower bound is tight and demonstrates the optimality of Freya PAGE in the large-scale regime, i.e., when $\sqrt{m} \geq n$, where $n$ is # of workers, and $m$ is # of data samples.

Freya PAGE: First Optimal Time Complexity for Large-Scale Nonconvex Finite-Sum Optimization with Heterogeneous Asynchronous Computations

TL;DR

A lower bound for smooth nonconvex finite-sum problems in the asynchronous setup is established, providing a fundamental time complexity limit and demonstrates the optimality of Freya PAGE in the large-scale regime.

Abstract

In practical distributed systems, workers are typically not homogeneous, and due to differences in hardware configurations and network conditions, can have highly varying processing times. We consider smooth nonconvex finite-sum (empirical risk minimization) problems in this setup and introduce a new parallel method, Freya PAGE, designed to handle arbitrarily heterogeneous and asynchronous computations. By being robust to "stragglers" and adaptively ignoring slow computations, Freya PAGE offers significantly improved time complexity guarantees compared to all previous methods, including Asynchronous SGD, Rennala SGD, SPIDER, and PAGE, while requiring weaker assumptions. The algorithm relies on novel generic stochastic gradient collection strategies with theoretical guarantees that can be of interest on their own, and may be used in the design of future optimization methods. Furthermore, we establish a lower bound for smooth nonconvex finite-sum problems in the asynchronous setup, providing a fundamental time complexity limit. This lower bound is tight and demonstrates the optimality of Freya PAGE in the large-scale regime, i.e., when , where is # of workers, and is # of data samples.
Paper Structure (36 sections, 32 theorems, 165 equations, 3 figures, 2 tables, 10 algorithms)

This paper contains 36 sections, 32 theorems, 165 equations, 3 figures, 2 tables, 10 algorithms.

Key Result

Theorem 1

The expected time needed by Algorithm algorithm:ga_asynch_main_new to calculate $g = \frac{1}{m} \sum\limits_{i=1}^m \nabla f_i$ is at most seconds.

Figures (3)

  • Figure 1: Experiments with nonconvex quadratic optimization tasks. We plot function suboptimality against elapsed time.
  • Figure 2: Experiments with the logistic regression problem on the MNIST dataset.
  • Figure : ComputeGradient($x$)

Theorems & Definitions (71)

  • Theorem 1
  • Theorem 2
  • Definition 3: Equilibrium time
  • Theorem 4: Iteration complexity
  • Theorem 5: Time complexity with free parameters $p$ and $S$
  • Theorem 6: Main result
  • Theorem 7: Main result in the large-scale regime
  • Theorem 8: Main result in the large-scale regime using the ratio $L_{\pm}/L$
  • Example 1
  • Example 2
  • ...and 61 more