CURing Large Models: Compression via CUR Decomposition

Sanghyeon Park; Soo-Mook Moon

CURing Large Models: Compression via CUR Decomposition

Sanghyeon Park, Soo-Mook Moon

TL;DR

CURing presents a CUR decomposition-based framework to compress transformer weights by approximating $W$ with $C U R$ and optionally healing via a trainable $\Delta U$. By combining WANDA activation-informed row/column selection with DEIM pruning, CURing achieves strong parameter reduction while preserving input/output dimensions and interpretability, and it can recover performance through layer-wise KD without large retraining. Empirical results across Llama3.1-8B and other models show fast compression (often minutes) with competitive accuracy, and healing can significantly restore or even improve performance on held-out tasks. The approach offers a practical, memory-efficient alternative to pruning and full retraining, with a controllable subspace that mitigates forgetting and maintains task performance, making it suitable for deployment in resource-constrained environments.

Abstract

Large deep learning models have achieved remarkable success but are resource-intensive, posing challenges such as memory usage. We introduce CURing, a novel model compression method based on CUR matrix decomposition, which approximates weight matrices as the product of selected columns (C) and rows (R), and a small linking matrix (U). We apply this decomposition to weights chosen based on the combined influence of their magnitudes and activations. By identifying and retaining informative rows and columns, CURing significantly reduces model size with minimal performance loss. For example, it reduces Llama3.1-8B's parameters to 7.32B (-9%) in just 129 seconds, over 20 times faster than prior compression methods.

CURing Large Models: Compression via CUR Decomposition

TL;DR

CURing presents a CUR decomposition-based framework to compress transformer weights by approximating

with

and optionally healing via a trainable

. By combining WANDA activation-informed row/column selection with DEIM pruning, CURing achieves strong parameter reduction while preserving input/output dimensions and interpretability, and it can recover performance through layer-wise KD without large retraining. Empirical results across Llama3.1-8B and other models show fast compression (often minutes) with competitive accuracy, and healing can significantly restore or even improve performance on held-out tasks. The approach offers a practical, memory-efficient alternative to pruning and full retraining, with a controllable subspace that mitigates forgetting and maintains task performance, making it suitable for deployment in resource-constrained environments.

Abstract

Paper Structure (35 sections, 4 theorems, 47 equations, 11 figures, 6 tables)

This paper contains 35 sections, 4 theorems, 47 equations, 11 figures, 6 tables.

Introduction
Related Work
Pruning
Model Compression
Parameter-Efficient Fine-Tuning
Knowledge Distillation
CUR Matrix Decomposition
DEIM-CUR
Parameter Reduction
CURing
Layer Selection
CUR Decomposition on Weights
Decomposing Multi-Head Attentions
Decomposing Feed-Forward Networks
Layer-wise Knowledge Distillation
...and 20 more sections

Key Result

Theorem 3.1

Let $W \in \mathbb{R}^{m \times n}$ and $1 \le r \le \min(m, n)$. The rank-$r$ singular value decomposition of $W$ is expressed as $W \approx P \Sigma Q^T$, where $P \in \mathbb{R}^{m \times r}$ and $Q \in \mathbb{R}^{n \times r}$ consist of the leading $r$ left and right singular vectors, respectiv where $\sigma_{r+1}$ is the first neglected singular value of $W$, and the finite error constants a

Figures (11)

Figure 1: Comparison of compression-and-adaptation methods: LoRA, MoRA, and our proposed CURing. Trainable parameters are in red, with $r$ denoting rank. MoRA and CURing can use a larger $r$ than LoRA without losing parameter efficiency. Figures \ref{['fig:subfig1']} and \ref{['fig:subfig2']} use a compressed model $\widehat{W}$ (e.g., from pruning) with accuracy recovered by retraining low-rank matrices. However, CURing (Figure \ref{['fig:subfig3']}) avoids retraining by using the low-parameter approximation ${W} \approx {C} {U}_0 {R}$. For further healing, we simply add a trainable matrix $\Delta {U}$ to ${U}_0$, without incurring additional inference overhead.
Figure 2: Process of rank-$r$ CUR decomposition in CURing.
Figure 3: CURing process illustrated based on the Llama3.1 architecture. (a) selecting target layers by angular distance, (b--c) decomposing their weights, and optionally (d) healing compression damage. The square multiplication symbol represents matrix multiplication, while the circular one denotes element-wise multiplication.
Figure 4: Performance comparison between compressed models and the original model (at $x=0$). The x-axis represents the number of compressed layers. We measure perplexity on C4 and WikiText2, and accuracy on BoolQ (two-choice) and MMLU (four-choice). The dashed lines are the baselines for random guessing, set at $0.5$ for BoolQ and $0.25$ for MMLU.
Figure 5: Training curves for the healing of CURing compared to LoRA and MoRA. All methods are applied after 10-layer compression. The x-axis represents steps.
...and 6 more figures

Theorems & Definitions (7)

Theorem 3.1
Theorem 4.1
Theorem 4.2
Theorem 4.3
proof
proof
proof

CURing Large Models: Compression via CUR Decomposition

TL;DR

Abstract

CURing Large Models: Compression via CUR Decomposition

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (7)