Gathering and Exploiting Higher-Order Information when Training Large Structured Models

Pierre Wolinski

Gathering and Exploiting Higher-Order Information when Training Large Structured Models

Pierre Wolinski

TL;DR

This work introduces a projection-based framework to extract higher-order information from the loss landscape of very large models without computing full high-order tensors. By partitioning the parameter set into S subsets and representing parameters as a tuple of tensors, the authors define and compute Dθ such that order-d derivatives projected onto subspaces yield an S^d-sized representation, dramatically reducing storage and compute costs. The main contributions include (i) per-subset learning rates derived from projected first- and second-order information, (ii) a second-order optimization method employing the reduced matrix Ḣ and gradient ĝ, and (iii) an anisotropic cubic regularization using order-3 information to stabilize updates and achieve affine-subset reparameterization invariance. Experiments on standard CNNs and deep MLPs demonstrate that the approach captures inter-layer interactions, can scale to larger models via partitioning, and provides competitive optimization behavior relative to Adam and K-FAC, while highlighting the need for further stochastic extensions and partition-aware design. Overall, the paper shows that higher-order derivatives can be meaningfully leveraged through targeted projections to improve optimization in deep networks without incurring intractable Hessian computations.

Abstract

When training large models, such as neural networks, the full derivatives of order 2 and beyond are usually inaccessible, due to their computational cost. Therefore, among the second-order optimization methods, it is common to bypass the computation of the Hessian by using first-order information, such as the gradient of the parameters (e.g., quasi-Newton methods) or the activations (e.g., K-FAC). In this paper, we focus on the exact and explicit computation of projections of the Hessian and higher-order derivatives on well-chosen subspaces relevant for optimization. Namely, for a given partition of the set of parameters, we compute tensors that can be seen as "higher-order derivatives according to the partition", at a reasonable cost as long as the number of subsets of the partition remains small. Then, we give some examples of how these tensors can be used. First, we show how to compute a learning rate per subset of parameters, which can be used for hyperparameter tuning. Second, we show how to use these tensors at order 2 to construct an optimization method that uses information contained in the Hessian. Third, we show how to use these tensors at order 3 (information contained in the third derivative of the loss) to regularize this optimization method. The resulting training step has several interesting properties, including: it takes into account long-range interactions between the layers of the trained neural network, which is usually not the case in similar methods (e.g., K-FAC); the trajectory of the optimization is invariant under affine layer-wise reparameterization.

Gathering and Exploiting Higher-Order Information when Training Large Structured Models

TL;DR

Abstract

Paper Structure (89 sections, 1 theorem, 91 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 89 sections, 1 theorem, 91 equations, 6 figures, 5 tables, 2 algorithms.

Introduction
Main contribution: extracting higher-order information.
Application: computing per-layer learning rates.
Application: second-order optimization method.
Structure of the paper.
Context and motivation
Higher-order information
Using and estimating the Hessian in optimization
Quasi-Newton methods.
Applications to deep learning.
Summarizing the Hessian.
Invariance by affine reparameterization.
Methods based on the moments of the gradients.
Motivation
What are we really looking for?
...and 74 more sections

Key Result

theorem thmcountertheorem

The method has a linear rate of convergence. For any $\boldsymbol{\theta}_{t} \neq 0$: where $a_s = \min \mathrm{Sp}(\mathbf{H}_s)$ and $A_s = \max \mathrm{Sp}(\mathbf{H}_s)$. Moreover, this rate is optimal, since it is possible to build $\boldsymbol{\theta}_t$ such that:

Figures (6)

Figure 1: Setup: models trained by SGD on CIFAR-10. Submatrices of $\bar{\mathbf{H}}$ (1st row) and $\bar{\mathbf{H}}^{-1}$ (2nd row), where focus is on interactions: weight-weight, weight-bias, bias-bias of the different layers, at initialization and before best validation loss.
Figure 2: Training curves: Method \ref{['meth:order2']} (solid lines) versus its diagonal approximation (dotted lines) with various hyperparameters.
Figure 3: Matrices $\bar{\mathbf{H}}$ and $\bar{\mathbf{H}}^{-1}$ and per-subset-of-parameters learning rates obtained with VBigMLP. Legend for the figure on the right: solid lines: weights; dotted lines: biases. For each epoch $k \in \{20, 60, 100, 140, 180\}$, the reported value has been averaged over the epochs $\lbrack k - 20, k + 19 \rbrack$ to remove the noise.
Figure 4: VBigMLP + CIFAR-10.
Figure 5: Test metrics in various setups.
...and 1 more figures

Theorems & Definitions (4)

theorem thmcountertheorem
remark thmcounterremark
remark thmcounterremark
proof

Gathering and Exploiting Higher-Order Information when Training Large Structured Models

TL;DR

Abstract

Gathering and Exploiting Higher-Order Information when Training Large Structured Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)