Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time

Sungyoon Kim; Mert Pilanci

Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time

Sungyoon Kim, Mert Pilanci

TL;DR

The paper analyzes training regularized two-layer ReLU networks through a Gaussian-relaxed convex reformulation based on random hyperplane arrangements, proving that the relative gap between the non-convex objective $p^{*}$ and its relaxation $\tilde{p}^{*}$ scales as $O(\sqrt{\log n})$ under Gaussian data and mild conditions. It introduces a polynomial-time randomized algorithm with complexity $O(d^{3}m^{3})$ that achieves this approximation and shows that local gradient methods converge to high-quality stationary points with high probability, shedding light on their empirical effectiveness. The authors develop a duality-based and Gordon’s comparison framework, connect the analysis to cone sharpness $\mathcal{C}$, and extend the guarantees from unconstrained to constrained relaxations, with a MAX-CUT interpretation providing additional insights. Collectively, the work yields principled, scalable guarantees for convex relaxations of ReLU networks, offering theoretical explanation for SGD-like methods and a path toward tractable global-optimal approximations in polynomial time.

Abstract

In this paper, we study the optimality gap between two-layer ReLU networks regularized with weight decay and their convex relaxations. We show that when the training data is random, the relative optimality gap between the original problem and its relaxation can be bounded by a factor of O(log n^0.5), where n is the number of training samples. A simple application leads to a tractable polynomial-time algorithm that is guaranteed to solve the original non-convex problem up to a logarithmic factor. Moreover, under mild assumptions, we show that local gradient methods converge to a point with low training loss with high probability. Our result is an exponential improvement compared to existing results and sheds new light on understanding why local gradient methods work well.

Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time

TL;DR

and its relaxation

scales as

under Gaussian data and mild conditions. It introduces a polynomial-time randomized algorithm with complexity

that achieves this approximation and shows that local gradient methods converge to high-quality stationary points with high probability, shedding light on their empirical effectiveness. The authors develop a duality-based and Gordon’s comparison framework, connect the analysis to cone sharpness

, and extend the guarantees from unconstrained to constrained relaxations, with a MAX-CUT interpretation providing additional insights. Collectively, the work yields principled, scalable guarantees for convex relaxations of ReLU networks, offering theoretical explanation for SGD-like methods and a path toward tractable global-optimal approximations in polynomial time.

Abstract

Paper Structure (22 sections, 41 theorems, 180 equations, 3 figures)

This paper contains 22 sections, 41 theorems, 180 equations, 3 figures.

Introduction
Prior and Related Work
Main Results
Preliminaries
Overview of Theoretical Results
Notations for Proof
Overall Proof Strategy
Guarantees for the Unconstrained Convex Relaxation
Warmup: Unconstrained Relaxation Without Regularization
Unconstrained Relaxation with $l_2$ Regularization
Unconstrained Relaxation With Group $l_1$ Regularization
Connection to the MAX-CUT Problem
Scale of $\kappa$ for Random Data
Extension to the Constrained Problem
Cone Sharpness $\mathcal{C}$
...and 7 more sections

Key Result

Theorem 2.1

(Informal) Consider the two-layer ReLU network training problem and its convex relaxation Here, $|\mathcal{D}| = m/2$ and the elements are sampled by the hyperplane arrangement patterns of random Gaussian vectors. Assume (A1), $d$ is sufficiently large, and suppose $m = \kappa \max\{m^{*}, 320(\sqrt{c}+1)^2 \log(\frac{n}{\delta})\}$ for some fixed $\kappa \geq 1$. Then, with high probabi for som

Figures (3)

Figure 1: Convex relaxations of different widths. Here, we show the optimal value of the relaxed problem for different numbers of subsampled hyperplane arrangement patterns and regularization. Note that $n = 300, d = 10$, hence there are $\approx 30^{10}$ variables in the convex reformulation. However, using $\approx 30$ neurons in the relaxed problem optimizes the objective well.
Figure 2: Verification of (A2) for gradient descent. The upper figure shows how many hyperplane arrangement patterns change for random data, and the lower figure shows how many of them change for MNIST.
Figure 3: Effect of regularization for test performance. The left and the right figure shows test performance of a trained model for synthetic data and MNIST respectively, for different model sizes and regularization. For different choices of regularization, the test performance changes at maximum 5% for synthetic data, and 10% for MNIST.

Theorems & Definitions (63)

Theorem 2.1
Remark 2.2
Theorem 2.3
Lemma 2.4
Theorem 2.5
Remark 2.6
Proposition 3.1
Proposition 3.2
Corollary 3.3
Theorem 3.4
...and 53 more

Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time

TL;DR

Abstract

Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (63)