Table of Contents
Fetching ...

Learning Halfspaces and Neural Networks with Random Initialization

Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, Michael I. Jordan

TL;DR

The paper analyzes non-convex empirical risk minimization for learning halfspaces and multilayer neural networks with Lipschitz losses, showing that randomized initializations followed by simple optimization steps yield ε-excess risk in time polynomial in n and d but exponential in (L/ε^2) log(L/ε). It proves fundamental hardness results that prevent polynomial-time improvements in the ε-dependence in general, while also delivering positive results: agnostic learning for networks with modest complexity, and a BoostNet boosting-based method that efficiently learns networks under constant-margin separability (with exponential dependence on 1/γ). A simulation study on parity functions demonstrates practical advantages of BoostNet over standard backpropagation in challenging noisy settings. Overall, the work delineates both the capabilities and limitations of non-convex ERM approaches for learning halfspaces and deep nets, linking complexity-theoretic barriers to margin-based learnability and providing concrete, implementable algorithms for structured data scenarios.

Abstract

We study non-convex empirical risk minimization for learning halfspaces and neural networks. For loss functions that are $L$-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk $ε>0$. The time complexity is polynomial in the input dimension $d$ and the sample size $n$, but exponential in the quantity $(L/ε^2)\log(L/ε)$. These algorithms run multiple rounds of random initialization followed by arbitrary optimization steps. We further show that if the data is separable by some neural network with constant margin $γ>0$, then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin $Ω(γ)$. As a consequence, the algorithm achieves arbitrary generalization error $ε>0$ with ${\rm poly}(d,1/ε)$ sample and time complexity. We establish the same learnability result when the labels are randomly flipped with probability $η<1/2$.

Learning Halfspaces and Neural Networks with Random Initialization

TL;DR

The paper analyzes non-convex empirical risk minimization for learning halfspaces and multilayer neural networks with Lipschitz losses, showing that randomized initializations followed by simple optimization steps yield ε-excess risk in time polynomial in n and d but exponential in (L/ε^2) log(L/ε). It proves fundamental hardness results that prevent polynomial-time improvements in the ε-dependence in general, while also delivering positive results: agnostic learning for networks with modest complexity, and a BoostNet boosting-based method that efficiently learns networks under constant-margin separability (with exponential dependence on 1/γ). A simulation study on parity functions demonstrates practical advantages of BoostNet over standard backpropagation in challenging noisy settings. Overall, the work delineates both the capabilities and limitations of non-convex ERM approaches for learning halfspaces and deep nets, linking complexity-theoretic barriers to margin-based learnability and providing concrete, implementable algorithms for structured data scenarios.

Abstract

We study non-convex empirical risk minimization for learning halfspaces and neural networks. For loss functions that are -Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk . The time complexity is polynomial in the input dimension and the sample size , but exponential in the quantity . These algorithms run multiple rounds of random initialization followed by arbitrary optimization steps. We further show that if the data is separable by some neural network with constant margin , then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin . As a consequence, the algorithm achieves arbitrary generalization error with sample and time complexity. We establish the same learnability result when the labels are randomly flipped with probability .

Paper Structure

This paper contains 33 sections, 14 theorems, 97 equations, 2 figures, 4 algorithms.

Key Result

Lemma 1

Assume that $\mathcal{F}$ contains the constant zero function $f(x) \equiv 0$, then we have

Figures (2)

  • Figure 1: Comparing the step function (panel (a)) with its two continuous approximations (panels (b) and (c)).
  • Figure 2: Performance of BoostNet and BackProp on the problem of learning parity function with noise.

Theorems & Definitions (15)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1
  • Theorem 2
  • Definition : MAX-2-SAT
  • Proposition 1
  • Theorem 3
  • Theorem 4
  • Corollary 1
  • ...and 5 more