Gathering and Exploiting Higher-Order Information when Training Large Structured Models
Pierre Wolinski
TL;DR
This work introduces a projection-based framework to extract higher-order information from the loss landscape of very large models without computing full high-order tensors. By partitioning the parameter set into S subsets and representing parameters as a tuple of tensors, the authors define and compute Dθ such that order-d derivatives projected onto subspaces yield an S^d-sized representation, dramatically reducing storage and compute costs. The main contributions include (i) per-subset learning rates derived from projected first- and second-order information, (ii) a second-order optimization method employing the reduced matrix Ḣ and gradient ĝ, and (iii) an anisotropic cubic regularization using order-3 information to stabilize updates and achieve affine-subset reparameterization invariance. Experiments on standard CNNs and deep MLPs demonstrate that the approach captures inter-layer interactions, can scale to larger models via partitioning, and provides competitive optimization behavior relative to Adam and K-FAC, while highlighting the need for further stochastic extensions and partition-aware design. Overall, the paper shows that higher-order derivatives can be meaningfully leveraged through targeted projections to improve optimization in deep networks without incurring intractable Hessian computations.
Abstract
When training large models, such as neural networks, the full derivatives of order 2 and beyond are usually inaccessible, due to their computational cost. Therefore, among the second-order optimization methods, it is common to bypass the computation of the Hessian by using first-order information, such as the gradient of the parameters (e.g., quasi-Newton methods) or the activations (e.g., K-FAC). In this paper, we focus on the exact and explicit computation of projections of the Hessian and higher-order derivatives on well-chosen subspaces relevant for optimization. Namely, for a given partition of the set of parameters, we compute tensors that can be seen as "higher-order derivatives according to the partition", at a reasonable cost as long as the number of subsets of the partition remains small. Then, we give some examples of how these tensors can be used. First, we show how to compute a learning rate per subset of parameters, which can be used for hyperparameter tuning. Second, we show how to use these tensors at order 2 to construct an optimization method that uses information contained in the Hessian. Third, we show how to use these tensors at order 3 (information contained in the third derivative of the loss) to regularize this optimization method. The resulting training step has several interesting properties, including: it takes into account long-range interactions between the layers of the trained neural network, which is usually not the case in similar methods (e.g., K-FAC); the trajectory of the optimization is invariant under affine layer-wise reparameterization.
