Table of Contents
Fetching ...

Topological trivialization in non-convex empirical risk minimization

Andrea Montanari, Basil Saeed

TL;DR

A finite dimensional variational formula for the exponential growth rate of the expected number of local minima of the empirical risk and sufficient conditions under which the exponential growth rate vanishes and all empirical risk minimizers have the same asymptotic properties are provided.

Abstract

Given data $\{({\boldsymbol x}_i,y_i): i\le n\}$, with ${\boldsymbol x}_i$ standard $d$-dimensional Gaussian feature vectors, and $y_i\in{\mathbb R}$ response variables, we study the general problem of learning a model parametrized by ${\boldsymbol θ}\in{\mathbb R}^d$, by minimizing a loss function that depends on ${\boldsymbol θ}$ via the one-dimensional projections ${\boldsymbol θ}^{\sf T}{\boldsymbol x}_i$. While previous work mostly dealt with convex losses, our approach assumes general (non-convex) losses hence covering classical, yet poorly understood examples such as the perceptron and non-convex robust regression. We use the Kac-Rice formula to control the asymptotics of the expected number of local minima of the empirical risk, under the proportional asymptotics $n,d\to\infty$, $n/d\toα>1$. Specifically, we prove a finite dimensional variational formula for the exponential growth rate of the expected number of local minima. Further we provide sufficient conditions under which the exponential growth rate vanishes and all empirical risk minimizers have the same asymptotic properties (in fact, we expect the minimizer to be unique in these circumstances). We refer to this phenomenon as `rate trivialization.' If the population risk has a unique minimizer, our sufficient condition for rate trivialization is typically verified when the samples/parameters ratio $α$ is larger than a suitable constant $α_{\star}$. Previous general results of this type required $n\ge Cd \log d$. We illustrate our results in the case of non-convex robust regression. Based on heuristic arguments and numerical simulations, we present a conjecture for the exact location of the trivialization phase transition $α_{\text{tr}}$.

Topological trivialization in non-convex empirical risk minimization

TL;DR

A finite dimensional variational formula for the exponential growth rate of the expected number of local minima of the empirical risk and sufficient conditions under which the exponential growth rate vanishes and all empirical risk minimizers have the same asymptotic properties are provided.

Abstract

Given data , with standard -dimensional Gaussian feature vectors, and response variables, we study the general problem of learning a model parametrized by , by minimizing a loss function that depends on via the one-dimensional projections . While previous work mostly dealt with convex losses, our approach assumes general (non-convex) losses hence covering classical, yet poorly understood examples such as the perceptron and non-convex robust regression. We use the Kac-Rice formula to control the asymptotics of the expected number of local minima of the empirical risk, under the proportional asymptotics , . Specifically, we prove a finite dimensional variational formula for the exponential growth rate of the expected number of local minima. Further we provide sufficient conditions under which the exponential growth rate vanishes and all empirical risk minimizers have the same asymptotic properties (in fact, we expect the minimizer to be unique in these circumstances). We refer to this phenomenon as `rate trivialization.' If the population risk has a unique minimizer, our sufficient condition for rate trivialization is typically verified when the samples/parameters ratio is larger than a suitable constant . Previous general results of this type required . We illustrate our results in the case of non-convex robust regression. Based on heuristic arguments and numerical simulations, we present a conjecture for the exact location of the trivialization phase transition .
Paper Structure (40 sections, 15 theorems, 139 equations, 6 figures)

This paper contains 40 sections, 15 theorems, 139 equations, 6 figures.

Key Result

Theorem 1

For $\delta>0$, define the event Let Assumptions ass:regime to ass:theta_0 of Section sec:assumptions hold. Then for any $\textsf{A}_R,\textsf{a}_L>0$, the following hold.

Figures (6)

  • Figure 1: Tukey regression: Output of gradient descent from $M=30$ random initializations, on the same empirical risk landscape (same data $\{y_i,{\boldsymbol x}_i\}_{i\le n}$), to produce estimates $\hat{{\boldsymbol \theta}}^{(1)},\dots,\hat{{\boldsymbol \theta}}^{(30)}$. Here $d=200$ and results are averaged over $N=30$ different Left: Maximum distance $\max_{i,j}\|\hat{{\boldsymbol \theta}}^{(i)}-\hat{{\boldsymbol \theta}}^{(j)}\|_2$ as a function of $\alpha$ and the $\textrm{SNR}.$ Right: Number of clusters formed by the point estimates $\hat{{\boldsymbol \theta}}^{(1)},\dots,\hat{{\boldsymbol \theta}}^{(30)}$ (clusters are constructed by thresholding the normalized distance $\|\hat{{\boldsymbol \theta}}^{(i)} - \hat{{\boldsymbol \theta}}^{(j)}\|/(\|\hat{{\boldsymbol \theta}}^{(i)}\|_2 \|\hat{{\boldsymbol \theta}}^{(j)}\|_2)^{1/2}$ at $\varepsilon = 10^{-3}$).
  • Figure 2: Solving the optimality conditions of Definition \ref{['def:opt_FP_conds']} for the Tukey loss.
  • Figure 3: Spectrum of the Hessian of the empirical risk $\nabla^2 \hat{R}_n({\boldsymbol \theta})$ for Tukey regression, compareed with theoretical prediction. The Hessian eigenvalues $\lambda^{(j)}_1\le \lambda^{(j)}_2\le \cdots\le \lambda^{(j)}_d$ are computed using $n=10000$, for each of $50$ trials: $j\in\{1,\dots,50\}$. For each $i\le d$, we plot the average and standard deviation of the $i$-th eigenvalue, $\lambda^{(j)}_i$, over the $50$ trials. Both plots are produced with the same data, with the right plot zoomed to the left-edge of the support. Here $\textrm{SNR} = \textrm{SNR}_\star$ and $\kappa = 1.0.$ The continuous line reports the theoretical prediction for the lower edge of the asymptotic spectral distribution.
  • Figure 4: Negative exponential growth rate of the number of local minimizers of the empirical risk $\hat{R}_n({\boldsymbol \theta})$ for Tukey regression. Left: $\Phi_{\infty}(\rho)$ of Eq. \ref{['eq:phi_infty_def']} vs. $\rho$ for various values of $\alpha$. Center: zoom for values of $\alpha$ close to $\alpha =5$. Right: $\Phi_{0}(\rho)$ of Eq. \ref{['eq:phi_0_def']} vs. $\rho$ for several values of $\alpha$.
  • Figure 5: Predictions for the estimation error (left) and train loss (right) at the global minimum, compared with empirical results with GD, for the same quantities, namely $\|\hat{{\boldsymbol \theta}}_{\mathrm{GD}}-{\boldsymbol \theta}_0\|_2$ and $\hat{R}_n(\hat{{\boldsymbol \theta}}_\mathrm{GD})$. For GD, we used $d=500$ and an average of $100$ trials, with error bars indicating the standard deviation. For $\alpha > \alpha_{\hbox{\tiny\rm tr}}(\textrm{SNR}_\star) = 5,$ the predictions are produced by solving the equations (FP Eqs). The dashed lines correspond to the stability threshold. Theoretical predictions are proven to be asymptotically exact for $\alpha$ sufficiently large, but are only an heuristic approximation for $\alpha< \alpha_{\hbox{\tiny\rm tr}}(\textrm{SNR}_\star)$.
  • ...and 1 more figures

Theorems & Definitions (30)

  • Theorem 1
  • Remark 2.1
  • Definition 1
  • Remark 2.2
  • Remark 2.3
  • Theorem 2
  • Remark 2.4
  • Remark 2.5
  • Theorem 3
  • Corollary 1: Tukey regression
  • ...and 20 more