Table of Contents
Fetching ...

Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, Kehai Chen, Min Zhang

TL;DR

Bolaco tackles efficient compression of large language models by combining feature-based low-rank decomposition with pooled covariance estimation and sample-efficient Bayesian optimization to allocate low-rank dimensions across layers. The method includes a post-training refinement using a fixed subspace LoRA with diagonal tunings to recover residual performance. Empirical results on LLaMA-v2-7b/13b show Bolaco outperforms strong baselines in zero-shot tasks and language modeling at equivalent compression, preserving up to 96–98% of original performance at 20% compression. The work demonstrates good transferability of rank allocations across related models and highlights the importance of calibration data, objective design, and validation-data selection in BO for robust LLM compression.

Abstract

In recent years, large language models (LLMs) have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.

Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

TL;DR

Bolaco tackles efficient compression of large language models by combining feature-based low-rank decomposition with pooled covariance estimation and sample-efficient Bayesian optimization to allocate low-rank dimensions across layers. The method includes a post-training refinement using a fixed subspace LoRA with diagonal tunings to recover residual performance. Empirical results on LLaMA-v2-7b/13b show Bolaco outperforms strong baselines in zero-shot tasks and language modeling at equivalent compression, preserving up to 96–98% of original performance at 20% compression. The work demonstrates good transferability of rank allocations across related models and highlights the importance of calibration data, objective design, and validation-data selection in BO for robust LLM compression.

Abstract

In recent years, large language models (LLMs) have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.
Paper Structure (27 sections, 9 equations, 6 figures, 10 tables)

This paper contains 27 sections, 9 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Sensitivity of different types of layers to low-rank compression. Each curve represents the compression of only that parameter type, with the horizontal axis indicating the compression ratio for that specific parameter type.
  • Figure 2: Illustration of our Bolaco. It initializes a low-rank dimension allocation and compresses the model via feature-based low-rank compression. Then, it evaluates the compression performance and optimizes the low-rank dimension allocation through Gaussian process-based Bayesian optimization.
  • Figure 3: The perplexity of WikiText2 on LLaMA 2-7b with different compression ratios.
  • Figure 4: The average performance on zero-shot tasks about the transferability of rank allocation.
  • Figure 5: More results on low-rank sensitivity.
  • ...and 1 more figures