Table of Contents
Fetching ...

Linear Recursive Feature Machines provably recover low-rank matrices

Adityanarayanan Radhakrishnan, Mikhail Belkin, Dmitriy Drusvyatskiy

TL;DR

This work explicitly connects the mechanism of neural feature learning to a widely used class of algorithms for sparse recovery called iteratively re-weighted least squares (IRLS), and suggests that nonlinear neural network training can be viewed as a nonlinear extension of the IRLS algorithm.

Abstract

A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between (1) reweighting the feature vectors by the AGOP and (2) learning the prediction function in the transformed space. In this work, we develop the first theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparametrized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) generalizes the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithm as it is SVD-free. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion.

Linear Recursive Feature Machines provably recover low-rank matrices

TL;DR

This work explicitly connects the mechanism of neural feature learning to a widely used class of algorithms for sparse recovery called iteratively re-weighted least squares (IRLS), and suggests that nonlinear neural network training can be viewed as a nonlinear extension of the IRLS algorithm.

Abstract

A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between (1) reweighting the feature vectors by the AGOP and (2) learning the prediction function in the transformed space. In this work, we develop the first theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparametrized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) generalizes the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithm as it is SVD-free. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion.
Paper Structure (29 sections, 11 theorems, 60 equations, 7 figures)

This paper contains 29 sections, 11 theorems, 60 equations, 7 figures.

Key Result

Theorem 1

The fixed points $Z$ of eqn:simplerewrite are first-order critical points of the optimization problem: where we define the function $\psi(r) = \int_{0}^{r} \frac{s}{[\phi(s^2)]^2} ds$ for all $r \in \mathbb{R}$.

Figures (7)

  • Figure 1: Overview of our results. The functions in (B) are normalized so that $\psi(0) = 0$.
  • Figure 2: Performance of lin-RFM with various matrix powers $\alpha$, deep linear networks, and minimizing $\ell_1$ norm in sparse linear regression. All models are trained and tested on the same data, and all results are averaged over 5 random draws of data.
  • Figure 3: Performance of lin-RFM, deep linear networks, and minimizing nuclear norm in low rank matrix completion. All models are trained and tested on the same data, and all results are averaged over 5 random draws of data.
  • Figure 4: Performance of lin-RFM , deep linear networks, and minimizing nuclear norm in matrix completion as a function of the rank of the ground truth matrix. The dashed black line represents the number of degrees of freedom, $2dr - r^2$, of a rank $r$ matrix of size $d \times d$. Note that all curves must converge as $r \to d$ since there is only one rank $d$ solution requiring $d^2$ observations. Each point represents the number of observations at which the model was able to achieve under $10^{-3}$ test MSE across 5 random draws of data. Overall, we observe that lin-RFM with $\alpha = \frac{1}{2}$ requires the fewest samples to achieve consistently low test MSE. Lin-RFM requires up to thousands fewer examples than deep linear networks and directly minimizing nuclear norm.
  • Figure 5: Examples of basins of attraction for lin-RFM with $\phi$ as the identity map for matrices $Y \in \mathbb{R}^{3 \times 2}$ with $Y_{11} = a, Y_{12} = b, Y_{21} = c, Y_{32} = d$ with $a, b, c, d > 0$. Such matrices are a subset of those analyzed in Proposition \ref{['prop: convergence any obs']} for which we proved the rank 1 solution is a fixed point attractor. In this setting, lin-RFM is governed by a two dimensional dynamical system involving variables $x(t), y(t)$. We plot vector field $f(x, y)$ governing the evolution of $x(t), y(t)$ for various values of $a, b, c, d$ and plot in red the attractor $\mathbf{u} := \left(\frac{b}{a} , \frac{a}{b}\right)$ corresponding to rank 1 solution. Note that $x(0) = y(0) = 0$ when $M_0 = I$ in lin-RFM. The strength of the attractor is given by the maximum eigenvalue of the Jacobian of $f$ at $\mathbf{u}$, which is $\frac{a^2 d^2 + b^2 c^2}{a^2 b^2 + a^2 d^2 + b^2 c^2}$. This eigenvalue grows smaller when $a, b$ increase and leads to a stronger attractor for the rank 1 solution. This behavior is shown in the right hand plot where we scale the values of $a, b$ without altering $c, d$.
  • ...and 2 more figures

Theorems & Definitions (21)

  • Remark
  • Theorem 1
  • proof
  • Corollary 1
  • Theorem 2
  • proof
  • Theorem 3
  • Corollary 2
  • Theorem : Balancedness from Balancedness
  • proof
  • ...and 11 more