Table of Contents
Fetching ...

Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

Soo Min Kwon, Zekai Zhang, Dogyoon Song, Laura Balzano, Qing Qu

TL;DR

This work investigates the prohibitive compute and memory costs of overparameterized models and uncovers that training updates concentrate in a small invariant subspace of dimension $r$. It introduces compressed deep linear networks (C-DLNs) with a rank-$\hat{r}$ factorization, spectral initialization from a surrogate matrix, and a two-timescale gradient update to exploit low-dimensional learning dynamics, reducing parameters from $Ld^2$ to $2d\hat{r}+(L-2)\hat{r}^2$. Theoretical results show that, under incremental learning, the compressed network achieves a lower recovery error than the original network at all iterations in deep matrix factorization; empirical results demonstrate up to $2\times$ faster training and strong transfer to nonlinear architectures, including MLPs and ViTs, with negligible loss in accuracy. Together, these findings offer a principled approach to efficient training and deployment of large models by exploiting their intrinsic low-rank learning dynamics.

Abstract

Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers. We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how our compressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.

Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

TL;DR

This work investigates the prohibitive compute and memory costs of overparameterized models and uncovers that training updates concentrate in a small invariant subspace of dimension . It introduces compressed deep linear networks (C-DLNs) with a rank- factorization, spectral initialization from a surrogate matrix, and a two-timescale gradient update to exploit low-dimensional learning dynamics, reducing parameters from to . Theoretical results show that, under incremental learning, the compressed network achieves a lower recovery error than the original network at all iterations in deep matrix factorization; empirical results demonstrate up to faster training and strong transfer to nonlinear architectures, including MLPs and ViTs, with negligible loss in accuracy. Together, these findings offer a principled approach to efficient training and deployment of large models by exploiting their intrinsic low-rank learning dynamics.

Abstract

Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers. We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how our compressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.
Paper Structure (48 sections, 8 theorems, 48 equations, 22 figures, 3 tables, 1 algorithm)

This paper contains 48 sections, 8 theorems, 48 equations, 22 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $\boldsymbol{M}^* \in \mathbb{R}^{d\times d}$ be a matrix of rank $r$ and $\boldsymbol{M}^* = \bm{U}^* \bm{\Sigma}^* \bm{V}^{*\top}$ be a SVD of $\boldsymbol{M}^*$. Suppose we run Algorithm alg:alg to update all weights $\left( \widetilde{\bm{W}}_l \right)_{l=1}^L$ of Equation (eq:comp_deep_mf), where $\bm{\Lambda}(t) \in \mathbb{R}^{r\times r}$ is a diagonal matrix with entries $\lambda_i(t)^

Figures (22)

  • Figure 1: Prevalence of low-rank weight updates in various deep networks. Each plot visualizes the singular values of the weight updates from initialization for the penultimate layer weight matrix for different types of (nonlinear) network architectures: including deep linear network (DLN), multi-layer perception (MLP), VGG DBLP:journals/corr/SimonyanZ14a, and ViT-B DBLP:conf/iclr/DosovitskiyB0WZ21 (from left to right). The first two networks (i.e., DLN and MLP) are trained on MNIST, while the latter (i.e., VGG and ViT-B) are trained on CIFAR-10. The result shows a prevalent phenomenon across linear and nonlinear networks -- gradient descent only updates a small portion of the singular values, while the others remain small and almost unchanged. We provide plots for the respective singular vectors and discuss the training details in Appendix \ref{['sec:training_dets']}.
  • Figure 2: Motivating the benefits of our compressed DLN while showcasing the incremental learning phenomenon. Left: plot of the change in singular values of the end-to-end DLN for matrix completion with $r=10$. This shows incremental learning of singual values and implies that we can perform low-rank training within a small subspace without having to overparameterize. Right: recovery error for the original and compressed DLN across iterations and time, respectively.
  • Figure 3: Motivating the use of spectral initialization for DLNs. These plots measure the similarity between the first principal component of $\bm{W}_L(t)$ and $\bm{U}^*$ (and respectively $\bm{W}_1(t)$ and $\bm{V}^*$). This result shows that the left and right most factors of the DLN fit the left and right singular vectors of the target matrix $\boldsymbol{M}^*$, respectively.
  • Figure 4: Occurrence of the incremental learning phenomenon in deep matrix factorization. We observe that the first $r=5$ singular values are fitted incrementally, along with their respective singular subspaces, corroborating Assumption \ref{['ass:incremental']}.
  • Figure 5: Empirical results on deep linear matrix factorization. Left: Shows that our compressed network achieves a lower recovery error than the original network, corroborating \ref{['thm:recovery_mf']}. Right: Demonstrates the speed up over the original network.
  • ...and 17 more figures

Theorems & Definitions (12)

  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Proposition 1: Arora et al. arora2019implicit
  • Theorem 2
  • proof
  • ...and 2 more