Structured and Fast Optimization: The Kronecker SGD Algorithm
Zhao Song, Song Yue
TL;DR
This work tackles the computational bottleneck of SGD for high‑dimensional neural networks by exploiting structure in the input data. By assuming Kronecker product structure on the inputs, the authors design an asynchronous tree–based SGD that achieves per‑iteration costs that are effectively independent of the ambient dimension $d$ (up to a factor depending on the hidden width $m$). They prove convergence under a data‑dependent Gram matrix framework akin to NTK analyses and provide a complexity theorem: initialization costs scale as $O(m n d^{(\omega-1)/2})$ while each iteration costs $O(S_{\mathrm{batch}}^2 \cdot o(m) \cdot n)$, with $\omega$ the matrix multiplication exponent. The approach hinges on a Kronecker tensor trick and a forward/backward workflow built on the AsynchronousTree data structure, offering substantial efficiency gains for high‑dimensional inputs when the Kronecker assumption holds. This provides a principled path to scalable training of deep models on structured data where per‑iteration work no longer grows with $d$.
Abstract
Stochastic gradient descent (SGD) now acts as a fundamental part of optimization in current machine learning. Meanwhile, deep learning architectures have shown outstanding performance in a wide range of fields, such as natural language processing, bioinformatics, and computer vision. Nevertheless, as the parameter size $d$ increases, these models encounter serious efficiency challenges. Previous studies show that the per step calculation expense scales linearly with the input size $d$. To mitigate this, our paper explores inherent patterns, such as Kronecker products within the training examples. We consider input data points that can be represented as tensor products of lower-dimensional vectors. We introduce a novel stochastic optimization method where the computational load for every update scales sublinearly with $d$, assuming moderate structural properties of the inputs. We believe our research is the first work achieving this result, representing a significant step forward for efficient deep learning optimization. Our theoretical findings are supported by a formal theorem, demonstrating that the proposed algorithm can train a two-layer fully connected neural network with a per-iteration cost independent of $d$.
