Table of Contents
Fetching ...

Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness

Baekrok Shin, Chulhee Yun

TL;DR

It is shown that deep models avoid plasticity loss due to their low-rank bias, whereas depth-2 networks pre-trained under decoupled dynamics fail to converge to low-rank, even when resumed training (with additional data) satisfies the coupling condition -- shedding light on the mechanism behind this phenomenon.

Abstract

We study matrix completion via deep matrix factorization (a.k.a. deep linear neural networks) as a simplified testbed to examine how network depth influences training dynamics. Despite the simplicity and importance of the problem, prior theory largely focuses on shallow (depth-2) models and does not fully explain the implicit low-rank bias observed in deeper networks. We identify coupled dynamics as a key mechanism behind this bias and show that it intensifies with increasing depth. Focusing on gradient flow under block-diagonal observations, we prove: (a) networks of depth $\geq 3$ exhibit coupling unless initialized diagonally, and (b) convergence to rank-1 occurs if and only if the dynamics is coupled -- resolving an open question by Menon (2024) for a family of initializations. We also revisit the loss of plasticity phenomenon in matrix completion (Kleinman et al., 2024), where pre-training on few observations and resuming with more degrades performance. We show that deep models avoid plasticity loss due to their low-rank bias, whereas depth-2 networks pre-trained under decoupled dynamics fail to converge to low-rank, even when resumed training (with additional data) satisfies the coupling condition -- shedding light on the mechanism behind this phenomenon.

Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness

TL;DR

It is shown that deep models avoid plasticity loss due to their low-rank bias, whereas depth-2 networks pre-trained under decoupled dynamics fail to converge to low-rank, even when resumed training (with additional data) satisfies the coupling condition -- shedding light on the mechanism behind this phenomenon.

Abstract

We study matrix completion via deep matrix factorization (a.k.a. deep linear neural networks) as a simplified testbed to examine how network depth influences training dynamics. Despite the simplicity and importance of the problem, prior theory largely focuses on shallow (depth-2) models and does not fully explain the implicit low-rank bias observed in deeper networks. We identify coupled dynamics as a key mechanism behind this bias and show that it intensifies with increasing depth. Focusing on gradient flow under block-diagonal observations, we prove: (a) networks of depth exhibit coupling unless initialized diagonally, and (b) convergence to rank-1 occurs if and only if the dynamics is coupled -- resolving an open question by Menon (2024) for a family of initializations. We also revisit the loss of plasticity phenomenon in matrix completion (Kleinman et al., 2024), where pre-training on few observations and resuming with more degrades performance. We show that deep models avoid plasticity loss due to their low-rank bias, whereas depth-2 networks pre-trained under decoupled dynamics fail to converge to low-rank, even when resumed training (with additional data) satisfies the coupling condition -- shedding light on the mechanism behind this phenomenon.
Paper Structure (58 sections, 41 theorems, 336 equations, 33 figures)

This paper contains 58 sections, 41 theorems, 336 equations, 33 figures.

Key Result

Theorem 3.1

For the product model ${\bm{W}}_{{\bm{A}}, {\bm{B}}}(t) = {\bm{A}}(t) {\bm{B}}(t) \in \mathbb{R}^{2 \times 2}$, we consider the gradient flow dynamics (eqn: coupled dynamics), where the observations are $w_{11}^*(\neq 0)$ and $w_{21}^*(\neq 0)$. We assume convergence to the zero-loss solution (i.e.,

Figures (33)

  • Figure 1: (a) Examples of bipartite graphs corresponding to observation patterns of ${\bm{M}}_{\rm D}$ (disconnected) and ${\bm{M}}_{\rm C}$ (connected). (b-c) Training results showing effective rank (cf. roy2007effective) for completing rank-1 matrices ${\bm{M}}_{\rm D}$ and ${\bm{M}}_{\rm C}$, respectively. The rank-1 ground truth matrices were generated via ${\bm{u}}{\bm{v}}^\top$, where ${\bm{u}}, {\bm{v}} \in \mathbb{R}^2$ with entries sampled i.i.d. from a standard normal distribution. We initialized each layer's entries by sampling from a Gaussian distribution with mean zero and standard deviation $\alpha$, chosen to ensure the initial scale of the product matrix ${\bm{W}}_{L:1}(0)$ is approximately invariant to depth $L$. Each result shows an average of 300 independent random trials.
  • Figure 2: Singular values of ${\bm{W}}_{L:1}(\infty)$ (numerically obtained from Theorem \ref{['thm: block-diagonal']}) against initialization scale $\alpha^L$, for the diagonal observation task where $s=1$. Solid lines represent the largest singular value $\sigma_1$; dashed lines denote the other (identical) singular values $\sigma_r$ for $r \ge 2$. For finite $m$, these results illustrate that both greater depth $L$ and a smaller initial scale $\alpha$ strengthen the low-rank bias, in contrast to the $L=2$ case. Conversely, a very large $m$ ($m=10^{10}$), approximating an $\alpha{\bm{I}}_d$ (rank-$d$) initialization, leads to decoupled dynamics and a full-rank solution, independent of both $L$ and $\alpha$.
  • Figure 3: Experiments use a $100 \times 100$ rank-5 ground-truth matrix. Pre-training utilizes $2000$ randomly sampled entries ($\Omega_{\mathrm{pre}}$; $\lvert \Omega_{\mathrm{pre}} \rvert = 2000$), while post-training adds $1000$ more, forming $\Omega_{\mathrm{post}}$ ($\Omega_{\mathrm{pre}} \subset \Omega_{\mathrm{post}}$; $\lvert \Omega_{\mathrm{post}} \rvert = 3000$). The top row of panels displays effective rank, and the bottom row shows reconstruction error, both measured at convergence. The leftmost panels depict training on $\Omega_{\mathrm{pre}}$, and the rightmost on $\Omega_{\mathrm{post}}$, both starting from random Gaussian initialization. The middle panels show warm-start training on $\Omega_{\mathrm{post}}$, initialized from converged pre-trained models with $\Omega_{\mathrm{pre}}$.
  • Figure 4: The left panel shows the averaged effective rank of all possible connected patterns as a function of the initial scale $\alpha^L$. The right panel displays the averaged effective rank of all possible disconnected patterns.
  • Figure 5: Singular values of ${\bm{W}}_{L:1}(\infty)$ (numerically obtained from Theorem \ref{['thm: block-diagonal']}) against initialization scale $\alpha^L$ for the block-diagonal observation task. Solid lines represent the largest singular value $\sigma_1$; dashed lines denote the identical singular values $\sigma_i$ for $i \in \{2, \dots, n\}$. Note that $\sigma_j$ for $j \in \{n+1, \dots, d\}$ are all zero. For finite $m$, these results show that both greater depth $L$ and a smaller initial scale $\alpha$ strengthen the low-rank bias, in contrast to the $L=2$ case. Conversely, when $m$ is extremely large (e.g., $m = 10^{10}$), approximating an $\alpha {\bm{I}}_d$ rank $d$ initialization, the dynamics decouple and cannot achieve the minimal low-rank solution, regardless of $L$ or $\alpha$.
  • ...and 28 more figures

Theorems & Definitions (71)

  • Definition 1: Connectivity from baiconnectivity
  • Theorem 3.1
  • Definition 2: Coupled/Decoupled Dynamics
  • Proposition 3.2
  • Theorem 3.3
  • Corollary 3.3
  • Proposition 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Lemma B.1
  • ...and 61 more