Table of Contents
Fetching ...

A Convex Relaxation Approach to Generalization Analysis for Parallel Positively Homogeneous Networks

Uday Kiran Reddy Tadipatri, Benjamin D. Haeffele, Joshua Agterberg, René Vidal

TL;DR

This work presents a convex-relaxation framework for generalization analysis of parallel positively homogeneous networks by linking non-convex ERM to a convex surrogate in the prediction-function space. It develops a master theorem that decomposes generalization error into optimization-like and statistical components, yielding near-linear data requirements in the network width $R$ and parameter dimension $\mathrm{dim}(\mathcal{W})$ (up to logarithmic factors). The framework applies broadly—from low-rank matrix sensing and two-layer linear/ReLU nets to single-layer multi-head attention—providing principled, distribution-aware guarantees that improve understanding of generalization in non-convex architectures. These results offer a unifying lens on generalization via convex analysis and have potential to guide design choices in depth-two and attention-based models. Overall, the paper advances theory by delivering near-optimal, data-dependent bounds for a wide class of non-convex parallel networks through convex relaxation.

Abstract

We propose a general framework for deriving generalization bounds for parallel positively homogeneous neural networks--a class of neural networks whose input-output map decomposes as the sum of positively homogeneous maps. Examples of such networks include matrix factorization and sensing, single-layer multi-head attention mechanisms, tensor factorization, deep linear and ReLU networks, and more. Our general framework is based on linking the non-convex empirical risk minimization (ERM) problem to a closely related convex optimization problem over prediction functions, which provides a global, achievable lower-bound to the ERM problem. We exploit this convex lower-bound to perform generalization analysis in the convex space while controlling the discrepancy between the convex model and its non-convex counterpart. We apply our general framework to a wide variety of models ranging from low-rank matrix sensing, to structured matrix sensing, two-layer linear networks, two-layer ReLU networks, and single-layer multi-head attention mechanisms, achieving generalization bounds with a sample complexity that scales almost linearly with the network width.

A Convex Relaxation Approach to Generalization Analysis for Parallel Positively Homogeneous Networks

TL;DR

This work presents a convex-relaxation framework for generalization analysis of parallel positively homogeneous networks by linking non-convex ERM to a convex surrogate in the prediction-function space. It develops a master theorem that decomposes generalization error into optimization-like and statistical components, yielding near-linear data requirements in the network width and parameter dimension (up to logarithmic factors). The framework applies broadly—from low-rank matrix sensing and two-layer linear/ReLU nets to single-layer multi-head attention—providing principled, distribution-aware guarantees that improve understanding of generalization in non-convex architectures. These results offer a unifying lens on generalization via convex analysis and have potential to guide design choices in depth-two and attention-based models. Overall, the paper advances theory by delivering near-optimal, data-dependent bounds for a wide class of non-convex parallel networks through convex relaxation.

Abstract

We propose a general framework for deriving generalization bounds for parallel positively homogeneous neural networks--a class of neural networks whose input-output map decomposes as the sum of positively homogeneous maps. Examples of such networks include matrix factorization and sensing, single-layer multi-head attention mechanisms, tensor factorization, deep linear and ReLU networks, and more. Our general framework is based on linking the non-convex empirical risk minimization (ERM) problem to a closely related convex optimization problem over prediction functions, which provides a global, achievable lower-bound to the ERM problem. We exploit this convex lower-bound to perform generalization analysis in the convex space while controlling the discrepancy between the convex model and its non-convex counterpart. We apply our general framework to a wide variety of models ranging from low-rank matrix sensing, to structured matrix sensing, two-layer linear networks, two-layer ReLU networks, and single-layer multi-head attention mechanisms, achieving generalization bounds with a sample complexity that scales almost linearly with the network width.

Paper Structure

This paper contains 28 sections, 33 theorems, 513 equations, 1 figure, 4 tables.

Key Result

Theorem 1

Under assumptions ass:a1--ass:a4, let $f^*_{\mu_N}$ (or $f^*_{\mu}$) be the global minimizer for ${\sf C}_{\mu_N}(\cdot)$ (or ${\sf C}_{\mu}(\cdot))$. For any stationary points $(r, \{W_j\})$ of the function ${\sf NC}_{\mu_{N}}(\cdot)$ and any $f \in L^2(\mu) \cap L^2(\mu_N)$ the following are true: where $\Omega_{q}^{\circ}(\cdot)$ is referred to as polar in the measure $q$ defined as

Figures (1)

  • Figure 1: Numerical simulations of the Lipschitz constant (or upper bound thereof) obtained for different model widths $(r)$.

Theorems & Definitions (69)

  • Theorem 1: Convex Bounds for Learning
  • Theorem 2: Master Theorem
  • Corollary 1: Low-Rank Matrix Sensing
  • Corollary 2: Two-Layer ReLU Neural Network
  • Corollary 3: Transformers
  • Proposition 1: Convexity of induced regularizer
  • proof
  • Proposition 2
  • proof
  • Proposition 3: Stationary Points
  • ...and 59 more