Table of Contents
Fetching ...

Data-induced multiscale losses and efficient multirate gradient descent schemes

Juncai He, Liangchen Liu, Yen-Hsi Richard Tsai

TL;DR

This work investigates how multiscale data imprint scale-dependent structure on loss landscapes, gradients, and Hessians, and develops a data-informed optimization strategy. It derives a multiscale gradient expansion for logistic regression and deep nets, showing gradients decompose along scales and that small-scale coordinates contribute less to the gradient, while the Hessian spectrum mirrors the data. To exploit this structure, the authors propose Multirate Gradient Descent (MrGD), a gradient scheme with multiple learning rates aligned to eigenvalue groups, and prove convergence for quadratic and convex problems with explicit iteration-complexity bounds. The results provide theoretical justification for learning-rate warm-up and offer a principled approach to accelerating training on ill-conditioned, multiscale problems, with practical implications for large-scale neural network optimization. Overall, the paper connects multiscale data properties to optimized gradient dynamics and proposes a scalable, theoretically-grounded method to leverage that structure in training.

Abstract

This paper investigates the impact of multiscale data on machine learning algorithms, particularly in the context of deep learning. A dataset is multiscale if its distribution shows large variations in scale across different directions. This paper reveals multiscale structures in the loss landscape, including its gradients and Hessians inherited from the data. Correspondingly, it introduces a novel gradient descent approach, drawing inspiration from multiscale algorithms used in scientific computing. This approach seeks to transcend empirical learning rate selection, offering a more systematic, data-informed strategy to enhance training efficiency, especially in the later stages.

Data-induced multiscale losses and efficient multirate gradient descent schemes

TL;DR

This work investigates how multiscale data imprint scale-dependent structure on loss landscapes, gradients, and Hessians, and develops a data-informed optimization strategy. It derives a multiscale gradient expansion for logistic regression and deep nets, showing gradients decompose along scales and that small-scale coordinates contribute less to the gradient, while the Hessian spectrum mirrors the data. To exploit this structure, the authors propose Multirate Gradient Descent (MrGD), a gradient scheme with multiple learning rates aligned to eigenvalue groups, and prove convergence for quadratic and convex problems with explicit iteration-complexity bounds. The results provide theoretical justification for learning-rate warm-up and offer a principled approach to accelerating training on ill-conditioned, multiscale problems, with practical implications for large-scale neural network optimization. Overall, the paper connects multiscale data properties to optimized gradient dynamics and proposes a scalable, theoretically-grounded method to leverage that structure in training.

Abstract

This paper investigates the impact of multiscale data on machine learning algorithms, particularly in the context of deep learning. A dataset is multiscale if its distribution shows large variations in scale across different directions. This paper reveals multiscale structures in the loss landscape, including its gradients and Hessians inherited from the data. Correspondingly, it introduces a novel gradient descent approach, drawing inspiration from multiscale algorithms used in scientific computing. This approach seeks to transcend empirical learning rate selection, offering a more systematic, data-informed strategy to enhance training efficiency, especially in the later stages.
Paper Structure (33 sections, 23 theorems, 189 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 33 sections, 23 theorems, 189 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Proposition 1.2

When the data $\{\mathbf{x}_i\}_{i=1}^N$ satisfies assumption:1, the loss in (eqn:cross_entropy) has a multiscale gradient component: where $\mathcal{O}(\varepsilon_j)$ is a group of vectors in $\mathbb{R}^k$ of magnitudes bounded above by a constant multiple of $\varepsilon_j$.

Figures (4)

  • Figure 1: The magnitudes of the loss gradient with respect to $W^1_{1,:}$ and $W^1_{i,:}$ for a randomly chosen $i>0$ on the natural log scale; the quantities are labelled w1_1 and w1_2 respectively. Two models with test accuracy $30\%$ (blue) and $50\%$ (yellow) are investigated. The bottom row has gradient values sorted by their magnitudes. The red curve represents the principal values of the training data, scaled to align with the gradient magnitude for better comparison. The red curve is truncated in the bottom-right.
  • Figure 2: Eigenvalues of the Hessians, $\nabla^2_{W^1_{i,:}}\mathcal{L}$ and $\nabla^2_{W^2_{i,:}}\mathcal{L}$, on the natural log scale, under the same setup as in \ref{['fig:cifar_grad']}. The eigenvalues are sorted by magnitude. The original eigenvalues (not shown) only exhibit slight jittering toward the ending dimensins, with overall trends closely mirroring the ordered ones.
  • Figure 3: Numerical examples for randomly generated 100-dimensional linear regression problem with two scales ($m=1$, $r = 0.001$) and three scales ($m=2$, $r=0.1$). More details of this example can be found in Appendix \ref{['appendix_linearexample']}.
  • Figure 4: Comparison of eigenvalues (sorted by magnitudes) of the Hessian in the second hidden layer, under the natural log scale. The CIFAR models are those from \ref{['sec:NN_multiscale']}. The MNIST model is also a $3$-layer MLP of sizes $784$-$1024$-$128$-$10$, identical to the CIFAR model except for the input dimension. The MNIST model is also trained under the same setting: full gradient descent, fixed learning rate $0.1$, and cross-entropy loss. Three stages in training of the MNIST model with test accuracy $20\%$, $50\%$, and $80\%$, respectively are presented. All models in both the CIFAR and MNIST exhibit similar decaying property.

Theorems & Definitions (56)

  • Proposition 1.2
  • proof
  • Proposition 1.3
  • proof
  • Theorem 1.4
  • proof
  • Proposition 1.5
  • proof
  • Definition 2.1
  • Remark 2.2
  • ...and 46 more