Layer-wise dynamic rank for compressing large language models
Zhendong Mi, Bian Sun, Grace Li Zhang, Shaoyi Huang
TL;DR
This work tackles the memory and compute bottlenecks of large language models by introducing D-Rank, a layer-wise dynamic rank allocation framework for SVD-based compression that uses an effective rank metric to quantify information density. Through a Lagrange-multiplier optimization, D-Rank allocates more ranks to higher-density groups under a fixed budget, and it rebalances capacity across attention components by shifting some budget from W^Q and W^K to W^V. The approach is extended to models with grouped-query attention and validated across LLaMA and Mistral families, showing consistent improvements in perplexity and zero-shot reasoning, along with higher throughput, and robustness across seeds and calibration data. The results demonstrate a practical, scalable path to deploy compressed LLMs with preserved or enhanced performance, and the method can be combined with LoRA fine-tuning for additional gains.
Abstract
Large language models (LLMs) have rapidly scaled in size, bringing severe memory and computational challenges that hinder their deployment. Singular Value Decomposition (SVD)-based compression has emerged as an appealing post-training compression technique for LLMs, yet most existing methods apply a uniform compression ratio across all layers, implicitly assuming homogeneous information included in various layers. This overlooks the substantial intra-layer heterogeneity observed in LLMs, where middle layers tend to encode richer information while early and late layers are more redundant. In this work, we revisit the existing SVD-based compression method and propose D-Rank, a framework with layer-wise balanced Dynamic Rank allocation for LLMs compression. We first introduce effective rank as a principled metric to measure the information density of weight matrices, and then allocate ranks via a Lagrange multiplier-based optimization scheme to adaptively assign more capacity to groups with higher information density under a fixed compression ratio. Moreover, we rebalance the allocated ranks across attention layers to account for their varying importance and extend D-Rank to latest LLMs with grouped-query attention. Extensive experiments on various LLMs with different scales across multiple compression ratios demonstrate that D-Rank consistently outperforms SVD-LLM, ASVD, and Basis Sharing, achieving more than 15 lower perplexity with LLaMA-3-8B model on C4 datasets at 20% compression ratio and up to 5% higher zero-shot reasoning accuracy with LLaMA-7B model at 40% compression ratio while achieving even higher throughput.
