A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks
Tomas Hrycej, Bernhard Bermeitinger, Massimo Pavone, Götz-Henrik Wiegand, Siegfried Handschuh
TL;DR
The paper investigates optimizing deep neural networks by exploiting a suspected convexity structure of loss landscapes: non-convex near initialization and convex near the optimum. It introduces a two-phase algorithm that switches from the first-order Adam optimizer to a second-order Conjugate Gradient (CG) method at a swap point identified by the gradient norm $\|\nabla L(x)\|$, using a threshold of $0.9$ times the observed maximum gradient norm. Empirical results on Vision Transformer (ViT) variants and a VGG5 model across MNIST, CIFAR-10, and CIFAR-100 show that the gradient-norm peak pattern consistently occurs and that Adam+CG achieves faster convergence and better training loss than pure Adam, with CG yielding rapid loss decrease in the convex region. The study argues that this convexity-driven switching strategy yields practical gains across architectures, while remaining robust to deviations, and outlines future validation on larger text-based models.
Abstract
The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.
