Table of Contents
Fetching ...

BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments

Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu

TL;DR

BitStack tackles the challenge of deploying large language models under variable memory on local devices by introducing a training-free, decomposition-based weight compression strategy. It uses activation-aware scaling and iterative absolute value decomposition to generate small residual blocks that can be loaded from storage to dynamically reconstruct the full model, achieving approximately $1$ bit per parameter per iteration. By universally sorting residual blocks into a single stack and loading them based on available memory, BitStack enables megabyte-level memory–performance trade-offs and avoids re-compression for different memory budgets. Empirical results on Llama 2/3/3.1 show BitStack matches or surpasses practical baselines like GPTQ and AWQ, especially at extreme compression, and extends to instruction-tuned models with MT-Bench, highlighting its practical impact for offline and edge deployments.

Abstract

Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.

BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments

TL;DR

BitStack tackles the challenge of deploying large language models under variable memory on local devices by introducing a training-free, decomposition-based weight compression strategy. It uses activation-aware scaling and iterative absolute value decomposition to generate small residual blocks that can be loaded from storage to dynamically reconstruct the full model, achieving approximately bit per parameter per iteration. By universally sorting residual blocks into a single stack and loading them based on available memory, BitStack enables megabyte-level memory–performance trade-offs and avoids re-compression for different memory budgets. Empirical results on Llama 2/3/3.1 show BitStack matches or surpasses practical baselines like GPTQ and AWQ, especially at extreme compression, and extends to instruction-tuned models with MT-Bench, highlighting its practical impact for offline and edge deployments.

Abstract

Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.

Paper Structure

This paper contains 37 sections, 8 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: BitStack enables dynamic compression of LLMs in variable memory environments (\ref{['fig:memory_availability']}), while still matching or surpassing the performance of practical compression methods such as GPTQ frantar2022gptq and AWQ lin2024awq with the same memory footprint(\ref{['fig:memory_performance_tradeoff']}).
  • Figure 2: Overview of BitStack. BitStack dynamically loads and offloads residual blocks (Figure \ref{['fig:decompose']}) between RAM and storage devices based on current memory availability. We can load more weight residuals from storage when available memory increases (\ref{['fig:more_memory']}), or offload them otherwise (\ref{['fig:less_memory']}). The residual blocks for all weights across all layers are universally stored in the same stack on the storage device (grey blocks denote residual blocks for weights in other layers). Note that we omit positional embeddings, normalization layers, and residual connections in the figure for clarity.
  • Figure 3: Illustration of a residual block in BitStack. A residual block consists of a sign matrix and singular vectors obtained through absolute value decomposition. The sign matrix can be packed into GPU-supported data types to minimize memory usage. denotes the sign matrix while denotes the packed sign matrix.
  • Figure 4: Evaluation results of BitStack Llama 3.1 Instruct 8B/70B models on MT-Bench, assessed by gpt-4o. (\ref{['fig:mt-bench-single']}) demonstrates the single-answer grading results across various sizes of the 8B model loaded by BitStack, while (\ref{['fig:mt-bench-pair']}) illustrates the pairwise comparison results against AWQ at different compression ratios for both the 8B and 70B models.
  • Figure 5: Perplexity and average zero-shot performance of BitStack Llama 3.1 8B with or without activation-aware scaling and absolute value decomposition(AVD). In the "w/o scaling" experiments, no scaling is applied as in Eq. \ref{['eq:scaled_inference']}; in the "w/o AVD" experiments, vanilla SVD is used instead of AVD as in Eq. \ref{['eq:abs_svd']}. For vanilla SVD, we set $k'=k+\frac{m\times n}{16\times(m+n)}$(for ${\bm{W}} \in \mathbb{R}^{m \times n}$) to ensure the size of each residual block matches that of the main experiments. Solid lines represent average zero-shot performance, while dotted lines represent perplexity scores.
  • ...and 7 more figures