Table of Contents
Fetching ...

Structured and Fast Optimization: The Kronecker SGD Algorithm

Zhao Song, Song Yue

TL;DR

This work tackles the computational bottleneck of SGD for high‑dimensional neural networks by exploiting structure in the input data. By assuming Kronecker product structure on the inputs, the authors design an asynchronous tree–based SGD that achieves per‑iteration costs that are effectively independent of the ambient dimension $d$ (up to a factor depending on the hidden width $m$). They prove convergence under a data‑dependent Gram matrix framework akin to NTK analyses and provide a complexity theorem: initialization costs scale as $O(m n d^{(\omega-1)/2})$ while each iteration costs $O(S_{\mathrm{batch}}^2 \cdot o(m) \cdot n)$, with $\omega$ the matrix multiplication exponent. The approach hinges on a Kronecker tensor trick and a forward/backward workflow built on the AsynchronousTree data structure, offering substantial efficiency gains for high‑dimensional inputs when the Kronecker assumption holds. This provides a principled path to scalable training of deep models on structured data where per‑iteration work no longer grows with $d$.

Abstract

Stochastic gradient descent (SGD) now acts as a fundamental part of optimization in current machine learning. Meanwhile, deep learning architectures have shown outstanding performance in a wide range of fields, such as natural language processing, bioinformatics, and computer vision. Nevertheless, as the parameter size $d$ increases, these models encounter serious efficiency challenges. Previous studies show that the per step calculation expense scales linearly with the input size $d$. To mitigate this, our paper explores inherent patterns, such as Kronecker products within the training examples. We consider input data points that can be represented as tensor products of lower-dimensional vectors. We introduce a novel stochastic optimization method where the computational load for every update scales sublinearly with $d$, assuming moderate structural properties of the inputs. We believe our research is the first work achieving this result, representing a significant step forward for efficient deep learning optimization. Our theoretical findings are supported by a formal theorem, demonstrating that the proposed algorithm can train a two-layer fully connected neural network with a per-iteration cost independent of $d$.

Structured and Fast Optimization: The Kronecker SGD Algorithm

TL;DR

This work tackles the computational bottleneck of SGD for high‑dimensional neural networks by exploiting structure in the input data. By assuming Kronecker product structure on the inputs, the authors design an asynchronous tree–based SGD that achieves per‑iteration costs that are effectively independent of the ambient dimension (up to a factor depending on the hidden width ). They prove convergence under a data‑dependent Gram matrix framework akin to NTK analyses and provide a complexity theorem: initialization costs scale as while each iteration costs , with the matrix multiplication exponent. The approach hinges on a Kronecker tensor trick and a forward/backward workflow built on the AsynchronousTree data structure, offering substantial efficiency gains for high‑dimensional inputs when the Kronecker assumption holds. This provides a principled path to scalable training of deep models on structured data where per‑iteration work no longer grows with .

Abstract

Stochastic gradient descent (SGD) now acts as a fundamental part of optimization in current machine learning. Meanwhile, deep learning architectures have shown outstanding performance in a wide range of fields, such as natural language processing, bioinformatics, and computer vision. Nevertheless, as the parameter size increases, these models encounter serious efficiency challenges. Previous studies show that the per step calculation expense scales linearly with the input size . To mitigate this, our paper explores inherent patterns, such as Kronecker products within the training examples. We consider input data points that can be represented as tensor products of lower-dimensional vectors. We introduce a novel stochastic optimization method where the computational load for every update scales sublinearly with , assuming moderate structural properties of the inputs. We believe our research is the first work achieving this result, representing a significant step forward for efficient deep learning optimization. Our theoretical findings are supported by a formal theorem, demonstrating that the proposed algorithm can train a two-layer fully connected neural network with a per-iteration cost independent of .
Paper Structure (32 sections, 14 theorems, 78 equations, 5 algorithms)

This paper contains 32 sections, 14 theorems, 78 equations, 5 algorithms.

Key Result

Theorem 1.1

Given $n$ training samples $\{(x_i, y_i)\}_{i=1}^{n}$ such that for each $i\in[n]$, $x_i\in\mathbb{R}^d$ satisfies Kronecker property, there exists a stochastic gradient descent algorithm that can train a two-layer shifted ReLU activated neural network with $m$ neurons in the hidden layer such that

Theorems & Definitions (44)

  • Theorem 1.1: Informal version of Theorem \ref{['thm:main_formal']}
  • Definition 7.1: Data-dependent matrix $H$
  • Remark 7.2
  • Lemma 7.3: Lemma C.1 in syz21
  • Definition 7.4: Dynamic data-dependent matrix $H(t)$
  • Theorem 7.5
  • Definition 7.6: Fire set
  • Lemma 7.7: Lemma C.10 in syz21
  • Lemma 7.8
  • proof
  • ...and 34 more