Table of Contents
Fetching ...

Layer-wise dynamic rank for compressing large language models

Zhendong Mi, Bian Sun, Grace Li Zhang, Shaoyi Huang

TL;DR

This work tackles the memory and compute bottlenecks of large language models by introducing D-Rank, a layer-wise dynamic rank allocation framework for SVD-based compression that uses an effective rank metric to quantify information density. Through a Lagrange-multiplier optimization, D-Rank allocates more ranks to higher-density groups under a fixed budget, and it rebalances capacity across attention components by shifting some budget from W^Q and W^K to W^V. The approach is extended to models with grouped-query attention and validated across LLaMA and Mistral families, showing consistent improvements in perplexity and zero-shot reasoning, along with higher throughput, and robustness across seeds and calibration data. The results demonstrate a practical, scalable path to deploy compressed LLMs with preserved or enhanced performance, and the method can be combined with LoRA fine-tuning for additional gains.

Abstract

Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges that hinder their deployment. Singular Value Decomposition (SVD)-based compression has emerged as an appealing post-training compression technique for LLMs, yet most existing methods apply a uniform compression ratio across all layers, implicitly assuming homogeneous information included in various layers. This overlooks the substantial intra-layer heterogeneity observed in LLMs, where middle layers tend to encode richer information while early and late layers are more redundant. In this work, we revisit the existing SVD-based compression method and propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression. We first introduce effective rank as a principled metric to measure the information density of weight matrices, and then allocate ranks via a Lagrange multiplier-based optimization scheme to adaptively assign more capacity to groups with higher information density under a fixed compression ratio. Moreover, we rebalance the allocated ranks across attention layers to account for their varying importance and extend D-Rank to latest LLMs with grouped-query attention. Extensive experiments on various LLMs with different scales across multiple compression ratios demonstrate that D-Rank consistently outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40% compression ratio while achieving even higher throughput.

Layer-wise dynamic rank for compressing large language models

TL;DR

This work tackles the memory and compute bottlenecks of large language models by introducing D-Rank, a layer-wise dynamic rank allocation framework for SVD-based compression that uses an effective rank metric to quantify information density. Through a Lagrange-multiplier optimization, D-Rank allocates more ranks to higher-density groups under a fixed budget, and it rebalances capacity across attention components by shifting some budget from W^Q and W^K to W^V. The approach is extended to models with grouped-query attention and validated across LLaMA and Mistral families, showing consistent improvements in perplexity and zero-shot reasoning, along with higher throughput, and robustness across seeds and calibration data. The results demonstrate a practical, scalable path to deploy compressed LLMs with preserved or enhanced performance, and the method can be combined with LoRA fine-tuning for additional gains.

Abstract

Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges that hinder their deployment. Singular Value Decomposition (SVD)-based compression has emerged as an appealing post-training compression technique for LLMs, yet most existing methods apply a uniform compression ratio across all layers, implicitly assuming homogeneous information included in various layers. This overlooks the substantial intra-layer heterogeneity observed in LLMs, where middle layers tend to encode richer information while early and late layers are more redundant. In this work, we revisit the existing SVD-based compression method and propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression. We first introduce effective rank as a principled metric to measure the information density of weight matrices, and then allocate ranks via a Lagrange multiplier-based optimization scheme to adaptively assign more capacity to groups with higher information density under a fixed compression ratio. Moreover, we rebalance the allocated ranks across attention layers to account for their varying importance and extend D-Rank to latest LLMs with grouped-query attention. Extensive experiments on various LLMs with different scales across multiple compression ratios demonstrate that D-Rank consistently outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40% compression ratio while achieving even higher throughput.

Paper Structure

This paper contains 17 sections, 16 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The overall pipeline of our proposed D-Rank
  • Figure 2: Effective ranks of grouped $W^Q,W^K,W^V$ matrices for LLaMA-7B model on Wikitext-2 (two layers as a group)
  • Figure 3: LoRA fine-tuning PPL ($\downarrow$) results of compressed LLaMA- 7B
  • Figure 4: Throughput of dense LLaMA-7B model and the compressed model with Basis Sharing baseline and D-Rank under compression ratios from 20% to 50%.
  • Figure 5: Comparison of PPL with baselines on LLaMA-7B model when selecting the calibration data from Wikitext-2 with different seeds to compute $\mathcal{S}$