Extended convexity and smoothness and their applications in deep learning
Binchuan Qi, Wei Gong, Li Li
TL;DR
This work develops a generalized optimization framework for non-convex, non-smooth deep learning by extending strong convexity and Lipschitz smoothness through norm-power based convexity/smoothness notions, namely $8(\phi,c_\phi)$-convexity and $8(\Phi,c_\Phi)$-smoothness. It shows that empirical risk minimization can be interpreted as jointly minimizing a local gradient norm and a structural error, with the two components acting as tight bounds on the objective. The authors prove that SGD effectively reduces the local gradient norm and that architectural strategies like skip connections, over-parameterization, and random initialization help control the structural error, a claim supported by extensive experiments on MNIST, CIFAR-10, and Fashion-MNIST. The results provide a principled mechanism for understanding non-convex optimization in deep learning and offer practical guidance for designing architectures and training regimes that facilitate convergence toward favorable solutions.
Abstract
Classical assumptions like strong convexity and Lipschitz smoothness often fail to capture the nature of deep learning optimization problems, which are typically non-convex and non-smooth, making traditional analyses less applicable. This study aims to elucidate the mechanisms of non-convex optimization in deep learning by extending the conventional notions of strong convexity and Lipschitz smoothness. By leveraging these concepts, we prove that, under the established constraints, the empirical risk minimization problem is equivalent to optimizing the local gradient norm and structural error, which together constitute the upper and lower bounds of the empirical risk. Furthermore, our analysis demonstrates that the stochastic gradient descent (SGD) algorithm can effectively minimize the local gradient norm. Additionally, techniques like skip connections, over-parameterization, and random parameter initialization are shown to help control the structural error. Ultimately, we validate the core conclusions of this paper through extensive experiments. Theoretical analysis and experimental results indicate that our findings provide new insights into the mechanisms of non-convex optimization in deep learning.
