Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics
Soo Min Kwon, Zekai Zhang, Dogyoon Song, Laura Balzano, Qing Qu
TL;DR
This work investigates the prohibitive compute and memory costs of overparameterized models and uncovers that training updates concentrate in a small invariant subspace of dimension $r$. It introduces compressed deep linear networks (C-DLNs) with a rank-$\hat{r}$ factorization, spectral initialization from a surrogate matrix, and a two-timescale gradient update to exploit low-dimensional learning dynamics, reducing parameters from $Ld^2$ to $2d\hat{r}+(L-2)\hat{r}^2$. Theoretical results show that, under incremental learning, the compressed network achieves a lower recovery error than the original network at all iterations in deep matrix factorization; empirical results demonstrate up to $2\times$ faster training and strong transfer to nonlinear architectures, including MLPs and ViTs, with negligible loss in accuracy. Together, these findings offer a principled approach to efficient training and deployment of large models by exploiting their intrinsic low-rank learning dynamics.
Abstract
Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we present a novel approach for compressing overparameterized models, developed through studying their learning dynamics. We observe that for many deep models, updates to the weight matrices occur within a low-dimensional invariant subspace. For deep linear models, we demonstrate that their principal components are fitted incrementally within a small subspace, and use these insights to propose a compression algorithm for deep linear networks that involve decreasing the width of their intermediate layers. We empirically evaluate the effectiveness of our compression technique on matrix recovery problems. Remarkably, by using an initialization that exploits the structure of the problem, we observe that our compressed network converges faster than the original network, consistently yielding smaller recovery errors. We substantiate this observation by developing a theory focused on deep matrix factorization. Finally, we empirically demonstrate how our compressed model has the potential to improve the utility of deep nonlinear models. Overall, our algorithm improves the training efficiency by more than 2x, without compromising generalization.
