Table of Contents
Fetching ...

HASSLE-free: A unified Framework for Sparse plus Low-Rank Matrix Decomposition for LLMs

Mehdi Makni, Kayhan Behdin, Zheng Xu, Natalia Ponomareva, Rahul Mazumder

TL;DR

This work tackles the resource-intense deployment of large language models by introducing HASSLE-free, a unified one-shot sparse-plus-low-rank decomposition framework that directly minimizes a local layer-wise reconstruction objective. It adopts an alternating-minimization strategy across a sparse component and a low-rank component, leveraging full-Hessian information and UV-factorization for scalability to billions of parameters. Empirically, the approach yields substantial perplexity and zero-shot gains over prior one-shot methods, particularly for Llama-3-8B with 2:4 sparsity and a 64-rank low-rank component, illustrating hardware-aware compression without retraining. The method accelerates inference on GPUs and reduces storage requirements, with future directions including integration with quantization and exploration of additional sparsity patterns.

Abstract

The impressive capabilities of large foundation models come at a cost of substantial computing resources to serve them. Compressing these pre-trained models is of practical interest as it can democratize deploying them to the machine learning community at large by lowering the costs associated with inference. A promising compression scheme is to decompose foundation models' dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition of foundation models. Our framework introduces the local layer-wise reconstruction error objective for this decomposition, we demonstrate that prior work solves a relaxation of this optimization problem; and we provide efficient and scalable methods to minimize the exact introduced optimization problem. HASSLE-free substantially outperforms state-of-the-art methods in terms of the introduced objective and a wide range of LLM evaluation benchmarks. For the Llama3-8B model with a 2:4 sparsity component plus a 64-rank component decomposition, a compression scheme for which recent work shows important inference acceleration on GPUs, HASSLE-free reduces the test perplexity by 12% for the WikiText-2 dataset and reduces the gap (compared to the dense model) of the average of eight popular zero-shot tasks by 15% compared to existing methods.

HASSLE-free: A unified Framework for Sparse plus Low-Rank Matrix Decomposition for LLMs

TL;DR

This work tackles the resource-intense deployment of large language models by introducing HASSLE-free, a unified one-shot sparse-plus-low-rank decomposition framework that directly minimizes a local layer-wise reconstruction objective. It adopts an alternating-minimization strategy across a sparse component and a low-rank component, leveraging full-Hessian information and UV-factorization for scalability to billions of parameters. Empirically, the approach yields substantial perplexity and zero-shot gains over prior one-shot methods, particularly for Llama-3-8B with 2:4 sparsity and a 64-rank low-rank component, illustrating hardware-aware compression without retraining. The method accelerates inference on GPUs and reduces storage requirements, with future directions including integration with quantization and exploration of additional sparsity patterns.

Abstract

The impressive capabilities of large foundation models come at a cost of substantial computing resources to serve them. Compressing these pre-trained models is of practical interest as it can democratize deploying them to the machine learning community at large by lowering the costs associated with inference. A promising compression scheme is to decompose foundation models' dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition of foundation models. Our framework introduces the local layer-wise reconstruction error objective for this decomposition, we demonstrate that prior work solves a relaxation of this optimization problem; and we provide efficient and scalable methods to minimize the exact introduced optimization problem. HASSLE-free substantially outperforms state-of-the-art methods in terms of the introduced objective and a wide range of LLM evaluation benchmarks. For the Llama3-8B model with a 2:4 sparsity component plus a 64-rank component decomposition, a compression scheme for which recent work shows important inference acceleration on GPUs, HASSLE-free reduces the test perplexity by 12% for the WikiText-2 dataset and reduces the gap (compared to the dense model) of the average of eight popular zero-shot tasks by 15% compared to existing methods.

Paper Structure

This paper contains 22 sections, 2 theorems, 11 equations, 1 figure, 4 tables, 2 algorithms.

Key Result

Theorem 4.1

If ass:full-rank-diagonal holds, then the closed-form minimizer of eq:general-low-rank is given by

Figures (1)

  • Figure 1: Local layer-wise reconstruction error $\downarrow$ (lower values are preferred) analysis of the decomposition of the layers of the first transformer block in Llama-3-8B into a 2:4 sparse component plus a 64-rank low-rank component. All methods use the same number of Alternating-Minimization steps $80$.

Theorems & Definitions (2)

  • Theorem 4.1
  • Corollary 4.2