Table of Contents
Fetching ...

Implicit Regularization in Deep Matrix Factorization

Sanjeev Arora, Nadav Cohen, Wei Hu, Yuping Luo

TL;DR

The paper analyzes implicit regularization in deep linear networks for matrix completion and sensing (deep matrix factorization) under gradient descent. It shows that greater depth strengthens the bias toward low-rank solutions and can outperform nuclear-norm minimization in data-sparse regimes, while challenging the sufficiency of Schatten-$p$ or nuclear norms as universal descriptors. Through end-to-end gradient-flow analysis, it derives depth-dependent singular-value dynamics, revealing that depth accelerates growth of large singular values and suppresses small ones, effectively imposing a deeper low-rank bias. The findings emphasize the importance of optimization trajectories and dynamics for generalization in deep linear (and potentially non-linear) networks, motivating further study beyond traditional norm-based regularizers.

Abstract

Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization.

Implicit Regularization in Deep Matrix Factorization

TL;DR

The paper analyzes implicit regularization in deep linear networks for matrix completion and sensing (deep matrix factorization) under gradient descent. It shows that greater depth strengthens the bias toward low-rank solutions and can outperform nuclear-norm minimization in data-sparse regimes, while challenging the sufficiency of Schatten- or nuclear norms as universal descriptors. Through end-to-end gradient-flow analysis, it derives depth-dependent singular-value dynamics, revealing that depth accelerates growth of large singular values and suppresses small ones, effectively imposing a deeper low-rank bias. The findings emphasize the importance of optimization trajectories and dynamics for generalization in deep linear (and potentially non-linear) networks, motivating further study beyond traditional norm-based regularizers.

Abstract

Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization.

Paper Structure

This paper contains 23 sections, 11 theorems, 83 equations, 6 figures.

Key Result

Theorem 1

Assume the measurement matrices $A_1, \ldots, A_m$ commute. Then, if $\bar{W}_\mathrm{sha} := \lim_{\alpha \to 0} W_{\mathrm{sha}, \infty}(\alpha)$ exists and is a global optimum for Equation eq:psd_recover with $\ell(\bar{W}_\mathrm{sha}) = 0$, it holds that $\bar{W}_\mathrm{sha} \in \mathop{\mathr

Figures (6)

  • Figure 1: Matrix completion via gradient descent over deep matrix factorizations. Left (respectively, right) plot shows reconstruction errors for matrix factorizations of depths $2$, $3$ and $4$, when applied to the completion of a random rank-$5$ (respectively, rank-$10$) matrix with size $100 \times 100$. $x$-axis stands for the number of observed entries (randomly chosen), $y$-axis represents reconstruction error, and error bars (indiscernible) mark standard deviations of the results over multiple trials. All matrix factorizations are full-dimensional, i.e. have hidden dimensions $100$. Both learning rate and standard deviation of (random, zero-centered) initialization for gradient descent were set to the small value $10^{-3}$. Notice, with few observed entries factorizations of depths $3$ and $4$ significantly outperform that of depth $2$, whereas with more entries all factorizations perform well. For further details, and a similar experiment on matrix sensing tasks, see Appendix \ref{['app:exper']}.
  • Figure 2: Evaluation of nuclear norm as the implicit regularization in deep matrix factorization. Each plot compares gradient descent over matrix factorizations of depths $2$ and $3$ (results for depth $4$ were indistinguishable from those of depth $3$; we omit them to reduce clutter) against minimum nuclear norm solution and ground truth in matrix completion tasks. Top (respectively, bottom) row corresponds to completion of a random rank-$5$ (respectively, rank-$10$) matrix with size $100 \times 100$. Left, middle and right columns display (in $y$-axis) reconstruction error, nuclear norm and effective rank ( cf.roy2007effective) respectively. In each plot, $x$-axis stands for the number of observed entries (randomly chosen), and error bars (indiscernible) mark standard deviations of the results over multiple trials. All matrix factorizations are full-dimensional, i.e. have hidden dimensions $100$. Both learning rate and standard deviation of (random, zero-centered) initialization for gradient descent were initially set to $10^{-3}$. Running with smaller learning rate did not yield a noticeable change in terms of final results. Initializing with smaller standard deviation had no observable effect on results of depth $3$ (and $4$), but did impact those of depth $2$ --- the outcomes of dividing standard deviation by $2$ and by $4$ are included in the plots. Notice, with many observed entries minimum nuclear norm solution coincides with ground truth (minimum rank solution), and matrix factorizations of all depths converge to these. On the other hand, when there are fewer observed entries minimum nuclear norm solution does not coincide with ground truth, and matrix factorizations prefer to lower the effective rank at the expense of higher nuclear norm, in a manner that is more potent for deeper factorizations. For further details, and a similar experiment on matrix sensing tasks, see Appendix \ref{['app:exper']}.
  • Figure 3: Dynamics of gradient descent over deep matrix factorizations --- specifically, evolution of singular values and singular vectors of the product matrix during training for matrix completion. Top row corresponds to the task of completing a random rank-$5$ matrix with size $100 \times 100$ based on $2000$ randomly chosen observed entries; bottom row corresponds to training on $10000$ entries chosen randomly from the MovieLens 100K dataset (completion of a $943 \times 1682$ matrix, cf.harper2016movielens). First (left) three columns show top singular values for, respectively, depths $1$ (no matrix factorization), $2$ (shallow matrix factorization) and $3$ (deep matrix factorization). Last (right) column shows singular vectors for a depth-$2$ factorization, by comparing on- vs. off-diagonal entries in the matrix $U^\top(t) \nabla\ell(W(t)) V(t)$ (see Corollary \ref{['cor:sing_vecs_station']}) --- for each group of entries, mean of absolute values is plotted, along with shaded area marking the standard deviation. All matrix factorizations are full-dimensional (hidden dimensions $100$ in top row plots, $943$ in bottom row plots). Notice, increasing depth makes singular values move slower when small and faster when large (in accordance with Theorem \ref{['thm:sing_vals_evolve']}), which results in solutions with effectively lower rank. Notice also that $U^\top(t) \nabla\ell(W(t)) V(t)$ is diagonally dominant so long as there is movement, showing that singular vectors of the product matrix align with those of the gradient (in accordance with Corollary \ref{['cor:sing_vecs_station']}). For further details, and a similar experiment on matrix sensing, see Appendix \ref{['app:exper']}.
  • Figure 4: Matrix sensing via gradient descent over deep matrix factorizations. This figure is identical to Figure \ref{['fig:exper_intro']}, except that reconstruction of a ground truth matrix is based not on a randomly chosen subset of entries, but on a set of random projections ( i.e. on $\{ \left\langle{A_i},{W^*}\right\rangle \}_{i = 1}^m$ where $W^*$ is the ground truth and $A_1, \ldots, A_m$ are measurement matrices drawn independently from a Gaussian distribution). For further details on this experiment see Appendix \ref{['app:exper:imple']}.
  • Figure 5: Evaluation of nuclear norm as the implicit regularization in deep matrix factorization on matrix sensing tasks. This figure is identical to Figure \ref{['fig:exper_norm']}, except that reconstruction of a ground truth matrix is based not on a randomly chosen subset of entries, but on a set of random projections ( i.e. on $\{ \left\langle{A_i},{W^*}\right\rangle \}_{i = 1}^m$ where $W^*$ is the ground truth and $A_1, \ldots, A_m$ are measurement matrices drawn independently from a Gaussian distribution). For further details on this experiment see Appendix \ref{['app:exper:imple']}.
  • ...and 1 more figures

Theorems & Definitions (21)

  • Conjecture 1: from gunasekar2017implicit, informally stated
  • Theorem 1: adaptation of Theorem 1 in gunasekar2017implicit
  • Theorem 2
  • proof : Proof sketch (for complete proof see Appendix \ref{['app:proofs:nuclear_dmf']})
  • Proposition 1
  • proof : Proof sketch (for complete proof see Appendix \ref{['app:proofs:schatten_disq']})
  • Lemma 1
  • proof : Proof sketch (for complete proof see Appendix \ref{['app:proofs:asvd']})
  • Theorem 3
  • proof : Proof sketch (for complete proof see Appendix \ref{['app:proofs:sing_vals_evolve']})
  • ...and 11 more