Table of Contents
Fetching ...

Implicit Bias in Matrix Factorization and its Explicit Realization in a New Architecture

Yikun Hou, Suvrit Sra, Alp Yurtsever

TL;DR

This work investigates implicit bias in gradient-based matrix factorization and introduces a stable, explicit realization via X = U D U^T with U restricted to a Frobenius-norm ball and D diagonal and nonnegative. The proposed UDU formulation yields truly low-rank solutions across matrix completion and Fourier ptychography, addressing limitations of the classical Burer--Monteiro factorization. The authors extend the idea to neural networks with a constrained diagonal layer (UDV), achieving competitive performance while exhibiting a pronounced low-rank bias and enabling effective SVD-based pruning to produce compact models. A fixed-point analysis relates the new method to BM while clarifying the mechanisms that promote low-rank structure during training, and practical results demonstrate robustness across datasets, optimizers, and even LoRA-based fine-tuning. Overall, the approach offers a principled route to structured, memory-efficient representations with strong implicit regularization effects that are valuable for both theory and practical model compression.

Abstract

Gradient descent for matrix factorization exhibits an implicit bias toward approximately low-rank solutions. While existing theories often assume the boundedness of iterates, empirically the bias persists even with unbounded sequences. This reflects a dynamic where factors develop low-rank structure while their magnitudes increase, tending to align with certain directions. To capture this behavior in a stable way, we introduce a new factorization model: $X\approx UDV^\top$, where $U$ and $V$ are constrained within norm balls, while $D$ is a diagonal factor allowing the model to span the entire search space. Experiments show that this model consistently exhibits a strong implicit bias, yielding truly (rather than approximately) low-rank solutions. Extending the idea to neural networks, we introduce a new model featuring constrained layers and diagonal components that achieves competitive performance on various regression and classification tasks while producing lightweight, low-rank representations.

Implicit Bias in Matrix Factorization and its Explicit Realization in a New Architecture

TL;DR

This work investigates implicit bias in gradient-based matrix factorization and introduces a stable, explicit realization via X = U D U^T with U restricted to a Frobenius-norm ball and D diagonal and nonnegative. The proposed UDU formulation yields truly low-rank solutions across matrix completion and Fourier ptychography, addressing limitations of the classical Burer--Monteiro factorization. The authors extend the idea to neural networks with a constrained diagonal layer (UDV), achieving competitive performance while exhibiting a pronounced low-rank bias and enabling effective SVD-based pruning to produce compact models. A fixed-point analysis relates the new method to BM while clarifying the mechanisms that promote low-rank structure during training, and practical results demonstrate robustness across datasets, optimizers, and even LoRA-based fine-tuning. Overall, the approach offers a principled route to structured, memory-efficient representations with strong implicit regularization effects that are valuable for both theory and practical model compression.

Abstract

Gradient descent for matrix factorization exhibits an implicit bias toward approximately low-rank solutions. While existing theories often assume the boundedness of iterates, empirically the bias persists even with unbounded sequences. This reflects a dynamic where factors develop low-rank structure while their magnitudes increase, tending to align with certain directions. To capture this behavior in a stable way, we introduce a new factorization model: , where and are constrained within norm balls, while is a diagonal factor allowing the model to span the entire search space. Experiments show that this model consistently exhibits a strong implicit bias, yielding truly (rather than approximately) low-rank solutions. Extending the idea to neural networks, we introduce a new model featuring constrained layers and diagonal components that achieves competitive performance on various regression and classification tasks while producing lightweight, low-rank representations.

Paper Structure

This paper contains 32 sections, 2 theorems, 19 equations, 37 figures, 7 tables.

Key Result

Lemma 1

Define the update variables before projection as $\bar{U} = U - 2 \eta \nabla f(X) U D$ and $\bar{D} = D - \eta U^\top \nabla f(X) U$, with $X = UDU^\top\!$. Suppose $(U, D)$ is a fixed point of the algorithm in (alg:udu), and let $u_j$ denote the $j^{\text{th}}$ column of $U$ and $\lambda_j$ the $j The proof is given in sec:appendix-fixed-point-analysis.

Figures (37)

  • Figure 1: Impact of step-size and initialization on implicit bias. Solid lines represent our UDU factorization, while dashed lines denote the classical BM factorization. [Left] Objective residual vs. iterations. [Right] Singular value spectrum after $10^{6}$ iterations. In all cases, UDU produces truly low-rank solutions, whereas the classical approach results in approximate low-rank structures.
  • Figure 2: [Left] Singular value spectra of the solutions obtained by BM and UDU after $10^4$ iterations with step size $0.1$. [Middle] The corresponding reconstructed image from the BM factorization exhibits artifacts. [Right] The UDU factorization produces a clean recovery.
  • Figure 3: UDV structure. The weights in diagonal layer $D$ are denoted as $w_j$.
  • Figure 4: Singular value spectra of the solutions obtained with UV and UDV formulations on the HPART regression task under different optimization algorithms.
  • Figure 5: Singular value spectra and post-training SVD-based pruning results for RegNetX-32GF on CIFAR-100 under different optimization algorithms. [Top] UDV induces faster spectral decay than the baseline UV model. [Bottom] Test accuracy as a function of remaining neurons after pruning (full model has 2520 neurons, x-axis cropped for clarity). UDV-based networks retain high accuracy under aggressive pruning
  • ...and 32 more figures

Theorems & Definitions (3)

  • Lemma 1: Fixed-point characterization
  • Proposition 1: Exclusion of spurious fixed points
  • proof