Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles
Mohammed Djameleddine Belgoumri, Mohamed Reda Bouadjenek, Hakim Hacid, Imran Razzak, Sunil Aryal
TL;DR
The rolling ball optimizer (RBO) reframes optimization as a non-local procedure by rolling a ball of radius $\rho$ over the loss surface, thereby smoothing fine-grained geometry and reducing sensitivity to data noise. By alternating a descent step with a projection onto the $\rho$-offset manifold, RBO achieves non-local updates that are less prone to sharp minima and ill-conditioned valleys. Theoretical results formalize the ironing property, showing that large $\rho$ preserves only large-scale structure, and that unreachable points emerge for sufficiently large radius, enhancing robustness. Empirically, RBO demonstrates faster convergence and competitive or superior generalization compared to SGD, Entropy-SGD, and SAM on MNIST and CIFAR benchmarks, with performance depending on the chosen radius and learning rate. The work highlights promising directions for robust training with non-local optimization while noting computational cost and the need for further validation in realistic, large-scale settings.
Abstract
Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions. The optimization landscape of these functions is often highly complex and textured, even fractal-like, with many spurious local minima, ill-conditioned valleys, degenerate points, and saddle points. Complicating things further is the fact that these landscape characteristics are a function of the data, meaning that noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry. This poses a difficulty for gradient-based optimization methods, which rely on local geometry to compute updates and are, therefore, vulnerable to being derailed by noisy data. In practice,this translates to a strong dependence of the optimization dynamics on the noise in the data, i.e., poor generalization performance. To remediate this problem, we propose a new optimization procedure: Rolling Ball Optimizer (RBO), that breaks this spatial locality by incorporating information from a larger region of the loss landscape in its updates. We achieve this by simulating the motion of a rigid sphere of finite radius rolling on the loss landscape, a straightforward generalization of Gradient Descent (GD) that simplifies into it in the infinitesimal limit. The radius serves as a hyperparameter that determines the scale at which RBO sees the loss landscape, allowing control over the granularity of its interaction therewith. We are motivated by the intuition that the large-scale geometry of the loss landscape is less data-specific than its fine-grained structure, and that it is easier to optimize. We support this intuition by proving that our algorithm has a smoothing effect on the loss function. Evaluation against SGD, SAM, and Entropy-SGD, on MNIST and CIFAR-10/100 demonstrates promising results in terms of convergence speed, training accuracy, and generalization performance.
