Table of Contents
Fetching ...

Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation

Can Yaras, Peng Wang, Laura Balzano, Qing Qu

TL;DR

The paper develops a theory and practical framework for compressing deep overparameterized low-rank learning by exploiting invariant, low-dimensional subspaces in the learning dynamics of weight matrices. It proves that, for deep matrix factorization, gradient descent dynamics stay confined to a subspace whose dimension is tied to the target rank $r^*$, enabling a compressed factorization that preserves the end-to-end trajectory with substantially fewer parameters. Leveraging this, the authors build a compression scheme applicable to deep matrix completion and introduce Deep LoRA, a three-layer overparameterized adaptation for language-model fine-tuning that reduces overfitting and hyperparameter sensitivity while maintaining efficiency. The approach yields significant training efficiency gains and improved generalization in limited-data regimes, and the provided code enables practitioners to adopt compressed, low-rank dynamics in practice. Overall, the work offers a principled path to retain the benefits of overparameterization through adaptively compressible dynamics with concrete theoretical guarantees and practical gains.

Abstract

While overparameterization in machine learning models offers great benefits in terms of optimization and generalization, it also leads to increased computational requirements as model sizes grow. In this work, we show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of overparameterization without the computational burdens. In practice, we demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models. Our approach is grounded in theoretical findings for deep overparameterized low-rank matrix recovery, where we show that the learning dynamics of each weight matrix are confined to an invariant low-dimensional subspace. Consequently, we can construct and train compact, highly compressed factorizations possessing the same benefits as their overparameterized counterparts. In the context of deep matrix completion, our technique substantially improves training efficiency while retaining the advantages of overparameterization. For language model fine-tuning, we propose a method called "Deep LoRA", which improves the existing low-rank adaptation (LoRA) technique, leading to reduced overfitting and a simplified hyperparameter setup, while maintaining comparable efficiency. We validate the effectiveness of Deep LoRA on natural language tasks, particularly when fine-tuning with limited data. Our code is available at https://github.com/cjyaras/deep-lora-transformers.

Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation

TL;DR

The paper develops a theory and practical framework for compressing deep overparameterized low-rank learning by exploiting invariant, low-dimensional subspaces in the learning dynamics of weight matrices. It proves that, for deep matrix factorization, gradient descent dynamics stay confined to a subspace whose dimension is tied to the target rank , enabling a compressed factorization that preserves the end-to-end trajectory with substantially fewer parameters. Leveraging this, the authors build a compression scheme applicable to deep matrix completion and introduce Deep LoRA, a three-layer overparameterized adaptation for language-model fine-tuning that reduces overfitting and hyperparameter sensitivity while maintaining efficiency. The approach yields significant training efficiency gains and improved generalization in limited-data regimes, and the provided code enables practitioners to adopt compressed, low-rank dynamics in practice. Overall, the work offers a principled path to retain the benefits of overparameterization through adaptively compressible dynamics with concrete theoretical guarantees and practical gains.

Abstract

While overparameterization in machine learning models offers great benefits in terms of optimization and generalization, it also leads to increased computational requirements as model sizes grow. In this work, we show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of overparameterization without the computational burdens. In practice, we demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models. Our approach is grounded in theoretical findings for deep overparameterized low-rank matrix recovery, where we show that the learning dynamics of each weight matrix are confined to an invariant low-dimensional subspace. Consequently, we can construct and train compact, highly compressed factorizations possessing the same benefits as their overparameterized counterparts. In the context of deep matrix completion, our technique substantially improves training efficiency while retaining the advantages of overparameterization. For language model fine-tuning, we propose a method called "Deep LoRA", which improves the existing low-rank adaptation (LoRA) technique, leading to reduced overfitting and a simplified hyperparameter setup, while maintaining comparable efficiency. We validate the effectiveness of Deep LoRA on natural language tasks, particularly when fine-tuning with limited data. Our code is available at https://github.com/cjyaras/deep-lora-transformers.
Paper Structure (40 sections, 4 theorems, 53 equations, 11 figures, 3 tables)

This paper contains 40 sections, 4 theorems, 53 equations, 11 figures, 3 tables.

Key Result

Theorem 2.1

Let $\bm W_l(t)$ satisfy the initialization scheme eq:init and updates eq:gd, and suppose $\bm \Phi \in \mathbb{R}^{d \times d}$ is at most rank $r$ and let $m := d - 2r > 0$. Then there exist orthogonal matrices $(\bm U_l)_{l=1}^L \subset \mathcal{O}^{d\times d}$ and $(\bm V_l)_{l=1}^L \subset \mat for all $l \in [L]$ and $t \geq 0$, where $\widetilde{\bm W}_l(t) \in \mathbb{R}^{2r \times 2r}$ wi

Figures (11)

  • Figure 1: Invariant low-dimensional subspaces in deep overparameterized adaptation of language models. Fine-tuning BERT devlin2019bert with deep overparameterized adaptation on the STS-B dataset cer2017semeval. Left: Singular value spectra across all adapted layers at the end of fine-tuning. Middle: Alignment of subspaces formed by top 8 right singular vectors between current adapted weights and final adapted weights throughout training. Right: Training loss continues to decrease in iterations after subspace alignment with final adapted weights. See \ref{['sec:dclora']} for more details.
  • Figure 2: Benefits of depth & width in overparameterized matrix completion with $d=100$, $r^*=5$, $\epsilon_l = 10^{-3}$ and 30% of entries observed. Left: Recovery error vs. width for shallow and deep factorizations. Right: Number of GD iterations to converge to $10^{-10}$ error vs. width. We observe that depth prevents overfitting, while width improves convergence.
  • Figure 3: Evolution of SVD of weight matrices. We visualize the SVD dynamics of the first layer weight matrix of an $L=3$ layer deep matrix factorization for a random matrix with $d = 30$, $r^*=3$, $\epsilon_l = 1$ throughout GD without weight decay. Left: Magnitude of the $i$-th singular value $\sigma_i(t)$ at iteration $t$. Middle: Angle $\angle(\bm v_i(t), \bm v_i(0))$ between the $i$-th right singular vector at iteration $t$ and initialization. Right: Angle $\angle(\bm u_i(t), \bm u_i(0))$ between the $i$-th left singular vector at iteration $t$ and initialization.
  • Figure 4: Network compression for deep matrix factorization. Comparison of trajectories for optimizing the original problem \ref{['eq:l2_loss']} vs. the compressed problem \ref{['eq:compression_loss']} with $L=3$, $d=1000$, $r = r^* = 5$, and $\epsilon_l = 10^{-3}$. Left: Principal components of end-to-end GD trajectories. Right: Training loss vs. wall-time comparison.
  • Figure 5: Network compression for deep matrix completion. Comparison of trajectories for optimizing the original problem \ref{['eq:mc_loss']} vs. the compressed problem \ref{['eq:compression_loss_mc']} with $\gamma$ discrepant updates ($\gamma = 0.01$) and ablating $\gamma$ ($\gamma = 0$) with $L=3$, $d=1000$, $r=r^*=5$, $\epsilon_l=10^{-3}$ and 20% of entries observed. Left: Principal components of end-to-end trajectories of each factorization. Middle: Recovery error vs. iteration comparison. Right: Recovery error vs wall-time comparison.
  • ...and 6 more figures

Theorems & Definitions (9)

  • Theorem 2.1
  • proof : Proof sketch
  • Proposition 2.2
  • Lemma E.1
  • proof
  • proof : Proof of \ref{['thm:1']}
  • Lemma E.2
  • proof : Proof of \ref{['lem:learning_rate']}
  • proof