Structured and Fast Optimization: The Kronecker SGD Algorithm

Zhao Song; Song Yue

Structured and Fast Optimization: The Kronecker SGD Algorithm

Zhao Song, Song Yue

TL;DR

This work tackles the computational bottleneck of SGD for high‑dimensional neural networks by exploiting structure in the input data. By assuming Kronecker product structure on the inputs, the authors design an asynchronous tree–based SGD that achieves per‑iteration costs that are effectively independent of the ambient dimension $d$ (up to a factor depending on the hidden width $m$). They prove convergence under a data‑dependent Gram matrix framework akin to NTK analyses and provide a complexity theorem: initialization costs scale as $O(m n d^{(\omega-1)/2})$ while each iteration costs $O(S_{\mathrm{batch}}^2 \cdot o(m) \cdot n)$, with $\omega$ the matrix multiplication exponent. The approach hinges on a Kronecker tensor trick and a forward/backward workflow built on the AsynchronousTree data structure, offering substantial efficiency gains for high‑dimensional inputs when the Kronecker assumption holds. This provides a principled path to scalable training of deep models on structured data where per‑iteration work no longer grows with $d$.

Abstract

Stochastic gradient descent (SGD) now acts as a fundamental part of optimization in current machine learning. Meanwhile, deep learning architectures have shown outstanding performance in a wide range of fields, such as natural language processing, bioinformatics, and computer vision. Nevertheless, as the parameter size $d$ increases, these models encounter serious efficiency challenges. Previous studies show that the per step calculation expense scales linearly with the input size $d$. To mitigate this, our paper explores inherent patterns, such as Kronecker products within the training examples. We consider input data points that can be represented as tensor products of lower-dimensional vectors. We introduce a novel stochastic optimization method where the computational load for every update scales sublinearly with $d$, assuming moderate structural properties of the inputs. We believe our research is the first work achieving this result, representing a significant step forward for efficient deep learning optimization. Our theoretical findings are supported by a formal theorem, demonstrating that the proposed algorithm can train a two-layer fully connected neural network with a per-iteration cost independent of $d$.

Structured and Fast Optimization: The Kronecker SGD Algorithm

TL;DR

(up to a factor depending on the hidden width

). They prove convergence under a data‑dependent Gram matrix framework akin to NTK analyses and provide a complexity theorem: initialization costs scale as

while each iteration costs

, with

the matrix multiplication exponent. The approach hinges on a Kronecker tensor trick and a forward/backward workflow built on the AsynchronousTree data structure, offering substantial efficiency gains for high‑dimensional inputs when the Kronecker assumption holds. This provides a principled path to scalable training of deep models on structured data where per‑iteration work no longer grows with

Abstract

increases, these models encounter serious efficiency challenges. Previous studies show that the per step calculation expense scales linearly with the input size

. To mitigate this, our paper explores inherent patterns, such as Kronecker products within the training examples. We consider input data points that can be represented as tensor products of lower-dimensional vectors. We introduce a novel stochastic optimization method where the computational load for every update scales sublinearly with

, assuming moderate structural properties of the inputs. We believe our research is the first work achieving this result, representing a significant step forward for efficient deep learning optimization. Our theoretical findings are supported by a formal theorem, demonstrating that the proposed algorithm can train a two-layer fully connected neural network with a per-iteration cost independent of

Paper Structure (32 sections, 14 theorems, 78 equations, 5 algorithms)

This paper contains 32 sections, 14 theorems, 78 equations, 5 algorithms.

Introduction
Our Results
Related Work
Kernel Matrix.
Roadmap
Notation
Problem Formulation
Gradient Descent (GD).
Stochastic Gradient Descent (SGD).
Asynchronous Tree and Kronecker Structured Data
Efficient SGD Algorithm
Convergence and Complexity Analysis
Convergence Analysis
Preliminary for Complexity Analysis
Properties of Kronecker Structure and Related Computation Facts
...and 17 more sections

Key Result

Theorem 1.1

Given $n$ training samples $\{(x_i, y_i)\}_{i=1}^{n}$ such that for each $i\in[n]$, $x_i\in\mathbb{R}^d$ satisfies Kronecker property, there exists a stochastic gradient descent algorithm that can train a two-layer shifted ReLU activated neural network with $m$ neurons in the hidden layer such that

Theorems & Definitions (44)

Theorem 1.1: Informal version of Theorem \ref{['thm:main_formal']}
Definition 7.1: Data-dependent matrix $H$
Remark 7.2
Lemma 7.3: Lemma C.1 in syz21
Definition 7.4: Dynamic data-dependent matrix $H(t)$
Theorem 7.5
Definition 7.6: Fire set
Lemma 7.7: Lemma C.10 in syz21
Lemma 7.8
proof
...and 34 more

Structured and Fast Optimization: The Kronecker SGD Algorithm

TL;DR

Abstract

Structured and Fast Optimization: The Kronecker SGD Algorithm

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (44)