Geometry and Optimization of Shallow Polynomial Networks

Yossi Arjevani; Joan Bruna; Joe Kileel; Elzbieta Polak; Matthew Trager

Geometry and Optimization of Shallow Polynomial Networks

Yossi Arjevani, Joan Bruna, Joe Kileel, Elzbieta Polak, Matthew Trager

TL;DR

This work studies shallow polynomial networks with monomial activations, revealing that their function spaces correspond to symmetric tensors of bounded rank and exhibit three width-regime transitions (low-dimensional, thick, filling) governed by the Alexander-Hirschowitz theorem. It connects optimization landscapes to tensor decompositions, analyzes teacher-student problems as low-rank approximation under data-induced inner products, and introduces a data discriminant capturing how training data and metrics qualitatively alter the loss surface. For quadratic activations, it furnishes Eckart-Young-type characterizations of all critical points and Hessian signatures under Gaussian and Frobenius norms, while showing that non-Gaussian data can induce exponentially many critical points. The results provide a rigorous, geometry-grounded framework to understand non-convex optimization in shallow networks and highlight how data distributions shape learning dynamics and landscape topology.

Abstract

We study shallow neural networks with monomial activations and output dimension one. The function space for these models can be identified with a set of symmetric tensors with bounded rank. We describe general features of these networks, focusing on the relationship between width and optimization. We then consider teacher-student problems, which can be viewed as problems of low-rank tensor approximation with respect to non-standard inner products that are induced by the data distribution. In this setting, we introduce a teacher-metric data discriminant which encodes the qualitative behavior of the optimization as a function of the training data distribution. Finally, we focus on networks with quadratic activations, presenting an in-depth analysis of the optimization landscape. In particular, we present a variation of the Eckart-Young Theorem characterizing all critical points and their Hessian signatures for teacher-student problems with quadratic networks and Gaussian training data.

Geometry and Optimization of Shallow Polynomial Networks

TL;DR

Abstract

Paper Structure (24 sections, 31 theorems, 85 equations, 3 figures)

This paper contains 24 sections, 31 theorems, 85 equations, 3 figures.

Introduction
Summary of main results.
Notation and preliminaries.
Shallow Polynomial Networks
The function space
Thick and filling spaces.
The parameterizing map.
Critical parameters and branch functions.
Optimization landscapes
Favorable landscapes.
Landscape and width.
Bad minima for wide networks.
Teacher-Student Problems
Functional norms
Critical points of the distance function
...and 9 more sections

Key Result

Theorem 1

If $d = 2$, we have $r_{\textup{thick}}(2,n) = r_{\textup{fill}}(2,n) = n$. If $d \ge 3$, then except for $(d,n) = (4,3),(4,4),(4,5),(3,5)$, when this bound needs to be increased by one. For all $(d,n)$, it holds that $r_{\rm fill}(d,n) \le 2 r_{\rm thick}(d,n)$.

Figures (3)

Figure 1: Landscapes of non-convex functions: (a) no bad minima and no spurious valleys; (b) bad minima and spurious valleys; (c) bad minima and no spurious valleys (all depicted points are local minima); (d) no bad minima and spurious valleys.
Figure 2: Visualizations of discriminants. Left: The focal locus of an ellipse with two 'teacher' points and the corresponding critical points for the associated teacher-student problem. Center: Focal curves encoded by the teacher-metric discriminant $P(x,y,m_{11}, m_{12}, m_{22}$). The top figure illustrates curves for $m_{22}=1, m_{01} = 0$ and $m_{11} = t$ (varying $t$) , while the bottom figure for $m_{11}=m_{22} = 1$ and $m_{10} = t$ (varying $t$); Right: A 3D plot of the determinantal variety $X = \{(t_{00}, t_{01}, t_{11}) \colon t_{00}t_{11} - t_{01}^2 = 0$} with three distinct focal surfaces corresponding to $m_{0000}=m_{1111}=1$, $m_{0001}=m_{0011}=m_{0111}=0$ and $m_{0101}=1,2,4$ (three surfaces).
Figure 3: (a)â€“(b) Eigenvalue histograms of student models after training, overlaid with the predicted minima. Both sample-based and norm-based optimization yield nearly identical spectra that closely match the eigenvalues of the predicted minima. (c) Frequencies of minima reached (sample-based setting; results are nearly identical for the norm-based setting)

Theorems & Definitions (69)

Theorem 1
Example 2
Remark 3
Proposition 4
proof
Lemma 5
proof
Proposition 6
proof
Proposition 7
...and 59 more

Geometry and Optimization of Shallow Polynomial Networks

TL;DR

Abstract

Geometry and Optimization of Shallow Polynomial Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (69)