Table of Contents
Fetching ...

Strong convexity-guided hyper-parameter optimization for flatter losses

Rahul Yedida, Snehanshu Saha

TL;DR

This paper tackles hyper-parameter optimization by linking loss-flatness to generalization through strong convexity. It proposes AHSC, a white-box HPO method that minimizes the strong convexity measure inferred from mini-batch Hessian information after a short initial training, pruning poor configurations before full training. The approach yields competitive performance across 14 datasets while substantially reducing runtime compared to traditional HPO methods, and it provides a theoretical connection between flatness and strong convexity that underpins the pruning step. The work includes practical algorithmic details, empirical validation, and public code, offering a scalable path to faster, landscape-aware hyper-parameter tuning.

Abstract

We propose a novel white-box approach to hyper-parameter optimization. Motivated by recent work establishing a relationship between flat minima and generalization, we first establish a relationship between the strong convexity of the loss and its flatness. Based on this, we seek to find hyper-parameter configurations that improve flatness by minimizing the strong convexity of the loss. By using the structure of the underlying neural network, we derive closed-form equations to approximate the strong convexity parameter, and attempt to find hyper-parameters that minimize it in a randomized fashion. Through experiments on 14 classification datasets, we show that our method achieves strong performance at a fraction of the runtime.

Strong convexity-guided hyper-parameter optimization for flatter losses

TL;DR

This paper tackles hyper-parameter optimization by linking loss-flatness to generalization through strong convexity. It proposes AHSC, a white-box HPO method that minimizes the strong convexity measure inferred from mini-batch Hessian information after a short initial training, pruning poor configurations before full training. The approach yields competitive performance across 14 datasets while substantially reducing runtime compared to traditional HPO methods, and it provides a theoretical connection between flatness and strong convexity that underpins the pruning step. The work includes practical algorithmic details, empirical validation, and public code, offering a scalable path to faster, landscape-aware hyper-parameter tuning.

Abstract

We propose a novel white-box approach to hyper-parameter optimization. Motivated by recent work establishing a relationship between flat minima and generalization, we first establish a relationship between the strong convexity of the loss and its flatness. Based on this, we seek to find hyper-parameter configurations that improve flatness by minimizing the strong convexity of the loss. By using the structure of the underlying neural network, we derive closed-form equations to approximate the strong convexity parameter, and attempt to find hyper-parameters that minimize it in a randomized fashion. Through experiments on 14 classification datasets, we show that our method achieves strong performance at a fraction of the runtime.
Paper Structure (11 sections, 7 theorems, 38 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 7 theorems, 38 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Let $A, B \in \mathbb{S}^n$ and suppose $A \succeq B$. Then $\lVert*\rVert{A}_2 \geq \lVert*\rVert{B}_2$

Figures (2)

  • Figure 1: Landscapes (plotted using li2018visualizing) with their corresponding metrics, on the Australian (binary classification, imbalanced) dataset. Left: a landscape with lower strong convexity (0.112), and consequently, a wider minima. Middle: a landscape with high strong convexity (1.133), which leads to a sharp minima. Right: Test metrics and generalization error for the two hyper-parameter configurations. Although the sharper configuration converged faster to a training error of 0, it generalizes poorly and performs worse on the test set.
  • Figure 2: Algorithm runtimes on the vehicle dataset.

Theorems & Definitions (18)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1: $\mu-$strong convexity for neural classifiers
  • proof
  • Definition 1: Fenchel conjugate
  • Definition 2: Dual norm
  • Definition 3: Strong convexity
  • Definition 4: Smoothness
  • ...and 8 more