Table of Contents
Fetching ...

Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time

Sungyoon Kim, Mert Pilanci

TL;DR

The paper analyzes training regularized two-layer ReLU networks through a Gaussian-relaxed convex reformulation based on random hyperplane arrangements, proving that the relative gap between the non-convex objective $p^{*}$ and its relaxation $\tilde{p}^{*}$ scales as $O(\sqrt{\log n})$ under Gaussian data and mild conditions. It introduces a polynomial-time randomized algorithm with complexity $O(d^{3}m^{3})$ that achieves this approximation and shows that local gradient methods converge to high-quality stationary points with high probability, shedding light on their empirical effectiveness. The authors develop a duality-based and Gordon’s comparison framework, connect the analysis to cone sharpness $\mathcal{C}$, and extend the guarantees from unconstrained to constrained relaxations, with a MAX-CUT interpretation providing additional insights. Collectively, the work yields principled, scalable guarantees for convex relaxations of ReLU networks, offering theoretical explanation for SGD-like methods and a path toward tractable global-optimal approximations in polynomial time.

Abstract

In this paper, we study the optimality gap between two-layer ReLU networks regularized with weight decay and their convex relaxations. We show that when the training data is random, the relative optimality gap between the original problem and its relaxation can be bounded by a factor of O(log n^0.5), where n is the number of training samples. A simple application leads to a tractable polynomial-time algorithm that is guaranteed to solve the original non-convex problem up to a logarithmic factor. Moreover, under mild assumptions, we show that local gradient methods converge to a point with low training loss with high probability. Our result is an exponential improvement compared to existing results and sheds new light on understanding why local gradient methods work well.

Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time

TL;DR

The paper analyzes training regularized two-layer ReLU networks through a Gaussian-relaxed convex reformulation based on random hyperplane arrangements, proving that the relative gap between the non-convex objective and its relaxation scales as under Gaussian data and mild conditions. It introduces a polynomial-time randomized algorithm with complexity that achieves this approximation and shows that local gradient methods converge to high-quality stationary points with high probability, shedding light on their empirical effectiveness. The authors develop a duality-based and Gordon’s comparison framework, connect the analysis to cone sharpness , and extend the guarantees from unconstrained to constrained relaxations, with a MAX-CUT interpretation providing additional insights. Collectively, the work yields principled, scalable guarantees for convex relaxations of ReLU networks, offering theoretical explanation for SGD-like methods and a path toward tractable global-optimal approximations in polynomial time.

Abstract

In this paper, we study the optimality gap between two-layer ReLU networks regularized with weight decay and their convex relaxations. We show that when the training data is random, the relative optimality gap between the original problem and its relaxation can be bounded by a factor of O(log n^0.5), where n is the number of training samples. A simple application leads to a tractable polynomial-time algorithm that is guaranteed to solve the original non-convex problem up to a logarithmic factor. Moreover, under mild assumptions, we show that local gradient methods converge to a point with low training loss with high probability. Our result is an exponential improvement compared to existing results and sheds new light on understanding why local gradient methods work well.
Paper Structure (22 sections, 41 theorems, 180 equations, 3 figures)

This paper contains 22 sections, 41 theorems, 180 equations, 3 figures.

Key Result

Theorem 2.1

(Informal) Consider the two-layer ReLU network training problem and its convex relaxation Here, $|\mathcal{D}| = m/2$ and the elements are sampled by the hyperplane arrangement patterns of random Gaussian vectors. Assume (A1), $d$ is sufficiently large, and suppose $m = \kappa \max\{m^{*}, 320(\sqrt{c}+1)^2 \log(\frac{n}{\delta})\}$ for some fixed $\kappa \geq 1$. Then, with high probabi for som

Figures (3)

  • Figure 1: Convex relaxations of different widths. Here, we show the optimal value of the relaxed problem for different numbers of subsampled hyperplane arrangement patterns and regularization. Note that $n = 300, d = 10$, hence there are $\approx 30^{10}$ variables in the convex reformulation. However, using $\approx 30$ neurons in the relaxed problem optimizes the objective well.
  • Figure 2: Verification of (A2) for gradient descent. The upper figure shows how many hyperplane arrangement patterns change for random data, and the lower figure shows how many of them change for MNIST.
  • Figure 3: Effect of regularization for test performance. The left and the right figure shows test performance of a trained model for synthetic data and MNIST respectively, for different model sizes and regularization. For different choices of regularization, the test performance changes at maximum 5% for synthetic data, and 10% for MNIST.

Theorems & Definitions (63)

  • Theorem 2.1
  • Remark 2.2
  • Theorem 2.3
  • Lemma 2.4
  • Theorem 2.5
  • Remark 2.6
  • Proposition 3.1
  • Proposition 3.2
  • Corollary 3.3
  • Theorem 3.4
  • ...and 53 more