CURing Large Models: Compression via CUR Decomposition
Sanghyeon Park, Soo-Mook Moon
TL;DR
CURing presents a CUR decomposition-based framework to compress transformer weights by approximating $W$ with $C U R$ and optionally healing via a trainable $\Delta U$. By combining WANDA activation-informed row/column selection with DEIM pruning, CURing achieves strong parameter reduction while preserving input/output dimensions and interpretability, and it can recover performance through layer-wise KD without large retraining. Empirical results across Llama3.1-8B and other models show fast compression (often minutes) with competitive accuracy, and healing can significantly restore or even improve performance on held-out tasks. The approach offers a practical, memory-efficient alternative to pruning and full retraining, with a controllable subspace that mitigates forgetting and maintains task performance, making it suitable for deployment in resource-constrained environments.
Abstract
Large deep learning models have achieved remarkable success but are resource-intensive, posing challenges such as memory usage. We introduce CURing, a novel model compression method based on CUR matrix decomposition, which approximates weight matrices as the product of selected columns (C) and rows (R), and a small linking matrix (U). We apply this decomposition to weights chosen based on the combined influence of their magnitudes and activations. By identifying and retaining informative rows and columns, CURing significantly reduces model size with minimal performance loss. For example, it reduces Llama3.1-8B's parameters to 7.32B (-9%) in just 129 seconds, over 20 times faster than prior compression methods.
