Table of Contents
Fetching ...

Training Overparametrized Neural Networks in Sublinear Time

Yichuan Deng, Hang Hu, Zhao Song, Omri Weinstein, Danyang Zhuo

TL;DR

The paper tackles the prohibitive cost of training massively overparameterized neural networks by introducing a binary-search-tree representation of DNNs and a sublinear-cost training algorithm. By exploiting Jacobian sparsity under a carefully chosen shift parameter and demonstrating stability of the nonzero Jacobian structure, the authors design a threshold-search data structure and a fast Gauss-Newton–style training pipeline that achieves amortized per-iteration cost of $\widetilde{O}( m^{1-α} n d + n^3 )$ (with $α \in (0.01,1)$) while preserving fast convergence. The approach combines sketching, iterative regression, and implicit weight maintenance, and provides rigorous convergence guarantees under data separability assumptions via shifted NTK. This work potentially enables scalable training of large two-layer networks and suggests a new algorithmic paradigm for deep learning that could extend to broader architectures and activation functions, with implications for AI scalability.

Abstract

The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI). Despite the popularity and low cost-per-iteration of traditional backpropagation via gradient decent, stochastic gradient descent (SGD) has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a typical neural network with $m=\mathrm{poly}(n)$ parameters and input batch of $n$ datapoints in $\mathbb{R}^d$, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires $\sim mnd + n^3$ time per iteration. In this paper, we present a novel training method that requires only $m^{1-α} n d + n^3$ amortized time in the same overparametrized regime, where $α\in (0.01,1)$ is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of deep neural networks (DNNs).

Training Overparametrized Neural Networks in Sublinear Time

TL;DR

The paper tackles the prohibitive cost of training massively overparameterized neural networks by introducing a binary-search-tree representation of DNNs and a sublinear-cost training algorithm. By exploiting Jacobian sparsity under a carefully chosen shift parameter and demonstrating stability of the nonzero Jacobian structure, the authors design a threshold-search data structure and a fast Gauss-Newton–style training pipeline that achieves amortized per-iteration cost of (with ) while preserving fast convergence. The approach combines sketching, iterative regression, and implicit weight maintenance, and provides rigorous convergence guarantees under data separability assumptions via shifted NTK. This work potentially enables scalable training of large two-layer networks and suggests a new algorithmic paradigm for deep learning that could extend to broader architectures and activation functions, with implications for AI scalability.

Abstract

The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI). Despite the popularity and low cost-per-iteration of traditional backpropagation via gradient decent, stochastic gradient descent (SGD) has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a typical neural network with parameters and input batch of datapoints in , the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires time per iteration. In this paper, we present a novel training method that requires only amortized time in the same overparametrized regime, where is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of deep neural networks (DNNs).
Paper Structure (51 sections, 39 theorems, 120 equations, 4 algorithms)

This paper contains 51 sections, 39 theorems, 120 equations, 4 algorithms.

Key Result

Theorem 1.1

Suppose there are $n$ training data points in $\mathbb{R}^d$. Let $f_{m,n}$ be a sufficiently wide two-layer $\mathsf{ReLU}$$\mathsf{NN}$ with $m = \mathop{\mathrm{poly}}\nolimits(n)$ neurons. Let $\alpha \in (0.01,1)$ be some fixed constant. Let $\epsilon \in (0,0.1)$ be an accuracy parameter. Let in amortized cost-per-iteration ($\mathsf{CPI}$) The overall running time (including initializatio

Theorems & Definitions (72)

  • Theorem 1.1: Main Result, Informal
  • Remark 1.2
  • Definition 2.1: Oblivious subspace embedding, OSE s06
  • Lemma 2.2
  • Definition 2.3: 2-layer $\mathsf{ReLU}$ activated neural network
  • Definition 2.4: Loss function
  • Definition 2.5: prediction function
  • Definition 2.6: Jacobi matrix and related definitions
  • Definition 2.7: Gram matrix
  • Remark 2.8
  • ...and 62 more