Learning Halfspaces and Neural Networks with Random Initialization
Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, Michael I. Jordan
TL;DR
The paper analyzes non-convex empirical risk minimization for learning halfspaces and multilayer neural networks with Lipschitz losses, showing that randomized initializations followed by simple optimization steps yield ε-excess risk in time polynomial in n and d but exponential in (L/ε^2) log(L/ε). It proves fundamental hardness results that prevent polynomial-time improvements in the ε-dependence in general, while also delivering positive results: agnostic learning for networks with modest complexity, and a BoostNet boosting-based method that efficiently learns networks under constant-margin separability (with exponential dependence on 1/γ). A simulation study on parity functions demonstrates practical advantages of BoostNet over standard backpropagation in challenging noisy settings. Overall, the work delineates both the capabilities and limitations of non-convex ERM approaches for learning halfspaces and deep nets, linking complexity-theoretic barriers to margin-based learnability and providing concrete, implementable algorithms for structured data scenarios.
Abstract
We study non-convex empirical risk minimization for learning halfspaces and neural networks. For loss functions that are $L$-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk $ε>0$. The time complexity is polynomial in the input dimension $d$ and the sample size $n$, but exponential in the quantity $(L/ε^2)\log(L/ε)$. These algorithms run multiple rounds of random initialization followed by arbitrary optimization steps. We further show that if the data is separable by some neural network with constant margin $γ>0$, then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin $Ω(γ)$. As a consequence, the algorithm achieves arbitrary generalization error $ε>0$ with ${\rm poly}(d,1/ε)$ sample and time complexity. We establish the same learnability result when the labels are randomly flipped with probability $η<1/2$.
