Table of Contents
Fetching ...

CAO: Curvature-Adaptive Optimization via Periodic Low-Rank Hessian Sketching

Wenzhang Du

TL;DR

CAO addresses slow optimization in sharp, anisotropic regions by introducing curvature_adaptive preconditioning based on a periodically refreshed, low_rank Hessian sketch. The method builds a rank_k sketch B from Hessian_vector products and applies the damped inverse P = (B + eta I)^{-1} to the gradient, updating theta <- theta - alpha P g and reusing B between refreshes. Theoretically, CAO achieves the standard O(1/T) stationarity rate under L_smoothness with a widened stepsize range, and exhibits contraction at refresh times under a PL_condition with bounded residual curvature. Empirically, on CIFAR-10/100 with ResNet-18/34, CAO with k = 1 reaches a 0.75 train_loss epoch threshold 2.95x faster than Adam while matching final test accuracy, with robustness to the exact rank choice. Reproducibility is ensured by anonymized logs and scripts to regenerate figures and tables.

Abstract

First-order optimizers are reliable but slow in sharp, anisotropic regions. We study a curvature-adaptive method that periodically sketches a low-rank Hessian subspace via Hessian--vector products and preconditions gradients only in that subspace, leaving the orthogonal complement first-order. For L-smooth non-convex objectives, we recover the standard O(1/T) stationarity guarantee with a widened stable stepsize range; under a Polyak--Lojasiewicz (PL) condition with bounded residual curvature outside the sketch, the loss contracts at refresh steps. On CIFAR-10/100 with ResNet-18/34, the method enters the low-loss region substantially earlier: measured by epochs to a pre-declared train-loss threshold (0.75), it reaches the threshold 2.95x faster than Adam on CIFAR-100/ResNet-18, while matching final test accuracy. The approach is one-knob: performance is insensitive to the sketch rank k across {1,3,5}, and k=0 yields a principled curvature-free ablation. We release anonymized logs and scripts that regenerate all figures and tables.

CAO: Curvature-Adaptive Optimization via Periodic Low-Rank Hessian Sketching

TL;DR

CAO addresses slow optimization in sharp, anisotropic regions by introducing curvature_adaptive preconditioning based on a periodically refreshed, low_rank Hessian sketch. The method builds a rank_k sketch B from Hessian_vector products and applies the damped inverse P = (B + eta I)^{-1} to the gradient, updating theta <- theta - alpha P g and reusing B between refreshes. Theoretically, CAO achieves the standard O(1/T) stationarity rate under L_smoothness with a widened stepsize range, and exhibits contraction at refresh times under a PL_condition with bounded residual curvature. Empirically, on CIFAR-10/100 with ResNet-18/34, CAO with k = 1 reaches a 0.75 train_loss epoch threshold 2.95x faster than Adam while matching final test accuracy, with robustness to the exact rank choice. Reproducibility is ensured by anonymized logs and scripts to regenerate figures and tables.

Abstract

First-order optimizers are reliable but slow in sharp, anisotropic regions. We study a curvature-adaptive method that periodically sketches a low-rank Hessian subspace via Hessian--vector products and preconditions gradients only in that subspace, leaving the orthogonal complement first-order. For L-smooth non-convex objectives, we recover the standard O(1/T) stationarity guarantee with a widened stable stepsize range; under a Polyak--Lojasiewicz (PL) condition with bounded residual curvature outside the sketch, the loss contracts at refresh steps. On CIFAR-10/100 with ResNet-18/34, the method enters the low-loss region substantially earlier: measured by epochs to a pre-declared train-loss threshold (0.75), it reaches the threshold 2.95x faster than Adam on CIFAR-100/ResNet-18, while matching final test accuracy. The approach is one-knob: performance is insensitive to the sketch rank k across {1,3,5}, and k=0 yields a principled curvature-free ablation. We release anonymized logs and scripts that regenerate all figures and tables.

Paper Structure

This paper contains 27 sections, 4 theorems, 5 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

For any $\theta$, $d$, and $\alpha>0$, if $f$ is $L$-smooth, then $f(\theta-\alpha d)\le f(\theta)-\alpha\langle\nabla f(\theta),d\rangle+\tfrac{L\alpha^2}{2}\|d\|^2$.

Figures (7)

  • Figure 1: Train loss on C100--R18 (mean over 3 seeds). A fixed horizontal threshold (0.75) and first-hit markers visualize time-to-threshold differences (see Table \ref{['tab:time-to-threshold']}).
  • Figure 2: Early-phase train loss (0--30 epochs) on C100--R18 (mean over 3 seeds). CAO descends faster with lower variance in early training.
  • Figure 3: Rank-$k$ ablation on C100--R18 ($k\in\{0,1,3,5\}$). Disabling curvature ($k{=}0$) slows early convergence; $k{\ge}1$ curves are nearly aligned, indicating rank-insensitivity.
  • Figure 4: Appendix A: Train loss on CIFAR-10 / ResNet-18 (3 seeds), no threshold.
  • Figure 5: Appendix B: Train loss on CIFAR-10 / ResNet-34 (3 seeds), no threshold.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Lemma 1: Descent under $L$-smoothness
  • Lemma 2: Bounded preconditioner
  • Proposition 1: Stationarity
  • Theorem 1: PL contraction