Table of Contents
Fetching ...

Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles

Mohammed Djameleddine Belgoumri, Mohamed Reda Bouadjenek, Hakim Hacid, Imran Razzak, Sunil Aryal

TL;DR

The rolling ball optimizer (RBO) reframes optimization as a non-local procedure by rolling a ball of radius $\rho$ over the loss surface, thereby smoothing fine-grained geometry and reducing sensitivity to data noise. By alternating a descent step with a projection onto the $\rho$-offset manifold, RBO achieves non-local updates that are less prone to sharp minima and ill-conditioned valleys. Theoretical results formalize the ironing property, showing that large $\rho$ preserves only large-scale structure, and that unreachable points emerge for sufficiently large radius, enhancing robustness. Empirically, RBO demonstrates faster convergence and competitive or superior generalization compared to SGD, Entropy-SGD, and SAM on MNIST and CIFAR benchmarks, with performance depending on the chosen radius and learning rate. The work highlights promising directions for robust training with non-local optimization while noting computational cost and the need for further validation in realistic, large-scale settings.

Abstract

Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions. The optimization landscape of these functions is often highly complex and textured, even fractal-like, with many spurious local minima, ill-conditioned valleys, degenerate points, and saddle points. Complicating things further is the fact that these landscape characteristics are a function of the data, meaning that noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry. This poses a difficulty for gradient-based optimization methods, which rely on local geometry to compute updates and are, therefore, vulnerable to being derailed by noisy data. In practice,this translates to a strong dependence of the optimization dynamics on the noise in the data, i.e., poor generalization performance. To remediate this problem, we propose a new optimization procedure: Rolling Ball Optimizer (RBO), that breaks this spatial locality by incorporating information from a larger region of the loss landscape in its updates. We achieve this by simulating the motion of a rigid sphere of finite radius rolling on the loss landscape, a straightforward generalization of Gradient Descent (GD) that simplifies into it in the infinitesimal limit. The radius serves as a hyperparameter that determines the scale at which RBO sees the loss landscape, allowing control over the granularity of its interaction therewith. We are motivated by the intuition that the large-scale geometry of the loss landscape is less data-specific than its fine-grained structure, and that it is easier to optimize. We support this intuition by proving that our algorithm has a smoothing effect on the loss function. Evaluation against SGD, SAM, and Entropy-SGD, on MNIST and CIFAR-10/100 demonstrates promising results in terms of convergence speed, training accuracy, and generalization performance.

Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles

TL;DR

The rolling ball optimizer (RBO) reframes optimization as a non-local procedure by rolling a ball of radius over the loss surface, thereby smoothing fine-grained geometry and reducing sensitivity to data noise. By alternating a descent step with a projection onto the -offset manifold, RBO achieves non-local updates that are less prone to sharp minima and ill-conditioned valleys. Theoretical results formalize the ironing property, showing that large preserves only large-scale structure, and that unreachable points emerge for sufficiently large radius, enhancing robustness. Empirically, RBO demonstrates faster convergence and competitive or superior generalization compared to SGD, Entropy-SGD, and SAM on MNIST and CIFAR benchmarks, with performance depending on the chosen radius and learning rate. The work highlights promising directions for robust training with non-local optimization while noting computational cost and the need for further validation in realistic, large-scale settings.

Abstract

Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions. The optimization landscape of these functions is often highly complex and textured, even fractal-like, with many spurious local minima, ill-conditioned valleys, degenerate points, and saddle points. Complicating things further is the fact that these landscape characteristics are a function of the data, meaning that noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry. This poses a difficulty for gradient-based optimization methods, which rely on local geometry to compute updates and are, therefore, vulnerable to being derailed by noisy data. In practice,this translates to a strong dependence of the optimization dynamics on the noise in the data, i.e., poor generalization performance. To remediate this problem, we propose a new optimization procedure: Rolling Ball Optimizer (RBO), that breaks this spatial locality by incorporating information from a larger region of the loss landscape in its updates. We achieve this by simulating the motion of a rigid sphere of finite radius rolling on the loss landscape, a straightforward generalization of Gradient Descent (GD) that simplifies into it in the infinitesimal limit. The radius serves as a hyperparameter that determines the scale at which RBO sees the loss landscape, allowing control over the granularity of its interaction therewith. We are motivated by the intuition that the large-scale geometry of the loss landscape is less data-specific than its fine-grained structure, and that it is easier to optimize. We support this intuition by proving that our algorithm has a smoothing effect on the loss function. Evaluation against SGD, SAM, and Entropy-SGD, on MNIST and CIFAR-10/100 demonstrates promising results in terms of convergence speed, training accuracy, and generalization performance.

Paper Structure

This paper contains 13 sections, 5 theorems, 28 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Proposition 2

Let $f: \mathbb{R}^d \to \mathbb{R}, \theta \mapsto \langle a, \theta\rangle + b$ for some $a, b \in \mathbb{R}^d$ be an affine function, and let $\varphi: \mathbb{R}^d \to \mathbb{R}$ be a bounded continuous function. Furthermore, let $\Gamma$ and $\Gamma^\prime$ be the graphs of $f$ and $f + \varp as $\rho \to +\infty$, where $K^\perp = K + \nu\mathbb{R}$ and $\nu = (a, -1)$. In particular, for

Figures (7)

  • Figure 1: One update step of the RBO.
  • Figure 2: Center trajectory of with different radii on the Riemann function (100 partial sum).
  • Figure 3: Training curves for the MNIST dataset.
  • Figure 4: Training curves for the CIFAR-10 and CIFAR-100 datasets.
  • Figure 5: Final validation accuracy of an trained with on the MNIST dataset as a function of $\rho$ and $\eta$.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Proposition 2: Linear ironing
  • Conjecture 3: Strong ironing property
  • Definition 1: Unreachable point
  • Proposition 4: Unreachable sharp minima
  • proof
  • Proposition 2: Linear ironing
  • proof
  • Corollary 3
  • proof
  • Conjecture 4: Strong ironing property
  • ...and 3 more