Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

Soo Min Kwon; Zekai Zhang; Dogyoon Song; Laura Balzano; Qing Qu

Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

Soo Min Kwon, Zekai Zhang, Dogyoon Song, Laura Balzano, Qing Qu

TL;DR

This work investigates the prohibitive compute and memory costs of overparameterized models and uncovers that training updates concentrate in a small invariant subspace of dimension $r$. It introduces compressed deep linear networks (C-DLNs) with a rank-$\hat{r}$ factorization, spectral initialization from a surrogate matrix, and a two-timescale gradient update to exploit low-dimensional learning dynamics, reducing parameters from $Ld^2$ to $2d\hat{r}+(L-2)\hat{r}^2$. Theoretical results show that, under incremental learning, the compressed network achieves a lower recovery error than the original network at all iterations in deep matrix factorization; empirical results demonstrate up to $2\times$ faster training and strong transfer to nonlinear architectures, including MLPs and ViTs, with negligible loss in accuracy. Together, these findings offer a principled approach to efficient training and deployment of large models by exploiting their intrinsic low-rank learning dynamics.

Abstract

Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers. We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how our compressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.

Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

TL;DR

This work investigates the prohibitive compute and memory costs of overparameterized models and uncovers that training updates concentrate in a small invariant subspace of dimension

. It introduces compressed deep linear networks (C-DLNs) with a rank-

factorization, spectral initialization from a surrogate matrix, and a two-timescale gradient update to exploit low-dimensional learning dynamics, reducing parameters from

. Theoretical results show that, under incremental learning, the compressed network achieves a lower recovery error than the original network at all iterations in deep matrix factorization; empirical results demonstrate up to

faster training and strong transfer to nonlinear architectures, including MLPs and ViTs, with negligible loss in accuracy. Together, these findings offer a principled approach to efficient training and deployment of large models by exploiting their intrinsic low-rank learning dynamics.

Abstract

Paper Structure (48 sections, 8 theorems, 48 equations, 22 figures, 3 tables, 1 algorithm)

This paper contains 48 sections, 8 theorems, 48 equations, 22 figures, 3 tables, 1 algorithm.

Introduction
Notation and Organization.
Efficient Network Compression Method
A Basic Problem Setup
Examples of Deep Matrix Recovery.
Efficient Low-Rank Network Compression Methods
Extension to Compression in Linear Layers of Nonlinear Networks
Theoretical Investigations
The Benefits of Spectral Initialization
Remarks.
The Benefits of Incremental Learning
Remarks.
Remarks.
Experiments
Matrix Recovery Problems
...and 33 more sections

Key Result

Theorem 1

Let $\boldsymbol{M}^* \in \mathbb{R}^{d\times d}$ be a matrix of rank $r$ and $\boldsymbol{M}^* = \bm{U}^* \bm{\Sigma}^* \bm{V}^{*\top}$ be a SVD of $\boldsymbol{M}^*$. Suppose we run Algorithm alg:alg to update all weights $\left( \widetilde{\bm{W}}_l \right)_{l=1}^L$ of Equation (eq:comp_deep_mf), where $\bm{\Lambda}(t) \in \mathbb{R}^{r\times r}$ is a diagonal matrix with entries $\lambda_i(t)^

Figures (22)

Figure 1: Prevalence of low-rank weight updates in various deep networks. Each plot visualizes the singular values of the weight updates from initialization for the penultimate layer weight matrix for different types of (nonlinear) network architectures: including deep linear network (DLN), multi-layer perception (MLP), VGG DBLP:journals/corr/SimonyanZ14a, and ViT-B DBLP:conf/iclr/DosovitskiyB0WZ21 (from left to right). The first two networks (i.e., DLN and MLP) are trained on MNIST, while the latter (i.e., VGG and ViT-B) are trained on CIFAR-10. The result shows a prevalent phenomenon across linear and nonlinear networks -- gradient descent only updates a small portion of the singular values, while the others remain small and almost unchanged. We provide plots for the respective singular vectors and discuss the training details in Appendix \ref{['sec:training_dets']}.
Figure 2: Motivating the benefits of our compressed DLN while showcasing the incremental learning phenomenon. Left: plot of the change in singular values of the end-to-end DLN for matrix completion with $r=10$. This shows incremental learning of singual values and implies that we can perform low-rank training within a small subspace without having to overparameterize. Right: recovery error for the original and compressed DLN across iterations and time, respectively.
Figure 3: Motivating the use of spectral initialization for DLNs. These plots measure the similarity between the first principal component of $\bm{W}_L(t)$ and $\bm{U}^*$ (and respectively $\bm{W}_1(t)$ and $\bm{V}^*$). This result shows that the left and right most factors of the DLN fit the left and right singular vectors of the target matrix $\boldsymbol{M}^*$, respectively.
Figure 4: Occurrence of the incremental learning phenomenon in deep matrix factorization. We observe that the first $r=5$ singular values are fitted incrementally, along with their respective singular subspaces, corroborating Assumption \ref{['ass:incremental']}.
Figure 5: Empirical results on deep linear matrix factorization. Left: Shows that our compressed network achieves a lower recovery error than the original network, corroborating \ref{['thm:recovery_mf']}. Right: Demonstrates the speed up over the original network.
...and 17 more figures

Theorems & Definitions (12)

Theorem 1
Corollary 1
Theorem 2
Theorem 1
proof
Corollary 1
proof
Proposition 1: Arora et al. arora2019implicit
Theorem 2
proof
...and 2 more

Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

TL;DR

Abstract

Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (12)