Data-induced multiscale losses and efficient multirate gradient descent schemes
Juncai He, Liangchen Liu, Yen-Hsi Richard Tsai
TL;DR
This work investigates how multiscale data imprint scale-dependent structure on loss landscapes, gradients, and Hessians, and develops a data-informed optimization strategy. It derives a multiscale gradient expansion for logistic regression and deep nets, showing gradients decompose along scales and that small-scale coordinates contribute less to the gradient, while the Hessian spectrum mirrors the data. To exploit this structure, the authors propose Multirate Gradient Descent (MrGD), a gradient scheme with multiple learning rates aligned to eigenvalue groups, and prove convergence for quadratic and convex problems with explicit iteration-complexity bounds. The results provide theoretical justification for learning-rate warm-up and offer a principled approach to accelerating training on ill-conditioned, multiscale problems, with practical implications for large-scale neural network optimization. Overall, the paper connects multiscale data properties to optimized gradient dynamics and proposes a scalable, theoretically-grounded method to leverage that structure in training.
Abstract
This paper investigates the impact of multiscale data on machine learning algorithms, particularly in the context of deep learning. A dataset is multiscale if its distribution shows large variations in scale across different directions. This paper reveals multiscale structures in the loss landscape, including its gradients and Hessians inherited from the data. Correspondingly, it introduces a novel gradient descent approach, drawing inspiration from multiscale algorithms used in scientific computing. This approach seeks to transcend empirical learning rate selection, offering a more systematic, data-informed strategy to enhance training efficiency, especially in the later stages.
