Revisiting "Qualitatively Characterizing Neural Network Optimization Problems"
Jonathan Frankle
TL;DR
The paper revisits the observation that the loss along the line segment from initialization to the final trained weights is approximately convex, testing this on modern networks and datasets. Using four image-classification settings and 100 interpolation points parameterized by $x \in [0,1]$, it examines the loss and test error along the path. Contrary to the MNIST-era finding, they observe that in large-scale settings the training loss remains near the initialization for much of the path and only falls near the optimum, with linearly accessible barriers appearing when interpolating from mid-training. These results imply that the simple convex-path picture does not generalize to modern architectures, though linear interpolation still serves as a qualitative diagnostic for optimization dynamics on current tasks.
Abstract
We revisit and extend the experiments of Goodfellow et al. (2014), who showed that - for then state-of-the-art networks - "the objective function has a simple, approximately convex shape" along the linear path between initialization and the trained weights. We do not find this to be the case for modern networks on CIFAR-10 and ImageNet. Instead, although loss is roughly monotonically non-increasing along this path, it remains high until close to the optimum. In addition, training quickly becomes linearly separated from the optimum by loss barriers. We conclude that, although Goodfellow et al.'s findings describe the "relatively easy to optimize" MNIST setting, behavior is qualitatively different in modern settings.
