Table of Contents
Fetching ...

FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang

TL;DR

FLAT-LLM tackles the challenge of deploying large language models in constrained environments by introducing a training-free, fine-grained structural compression that operates in the activation space. It uses head-wise PCA to compress the value and output projections within multi-head attention and a greedy, importance-aware rank selection to allocate ranks across decoder layers, enabling substantial model-size reduction without recovery fine-tuning. The approach achieves strong language modeling and downstream task performance across multiple models with meaningful inference speedups and compatibility with post-training quantization. A theoretical analysis links truncation loss to discarded eigenvalues, and extensive experiments demonstrate FLAT-LLM’s superior generalization and practical deployability compared to prior low-rank and pruning baselines.

Abstract

Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis, and employ a greedy budget redistribution strategy to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 5 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.

FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

TL;DR

FLAT-LLM tackles the challenge of deploying large language models in constrained environments by introducing a training-free, fine-grained structural compression that operates in the activation space. It uses head-wise PCA to compress the value and output projections within multi-head attention and a greedy, importance-aware rank selection to allocate ranks across decoder layers, enabling substantial model-size reduction without recovery fine-tuning. The approach achieves strong language modeling and downstream task performance across multiple models with meaningful inference speedups and compatibility with post-training quantization. A theoretical analysis links truncation loss to discarded eigenvalues, and extensive experiments demonstrate FLAT-LLM’s superior generalization and practical deployability compared to prior low-rank and pruning baselines.

Abstract

Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis, and employ a greedy budget redistribution strategy to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 5 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.

Paper Structure

This paper contains 35 sections, 2 theorems, 23 equations, 10 figures, 7 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\mathbf{Y}_v^h = \mathbf{X} {\mathbf{W}_v^h}{^\top}$, and $\tilde{\mathbf{Y}}_v^h = \mathbf{Y}_v^h\tilde{\mathbf{Q}}_v^h {\tilde{\mathbf{Q}}_v^h}{^\top}$ be the rank-$r$ approximation obtained by projecting $\mathbf{Y}_v^h$ onto its top-$r$ principal components. Then the squared Frobenius norm where $\{ \lambda_i^h \}$ are the eigenvalues of ${\mathbf{Y}^h_v}{^\top} \mathbf{Y}_v^h$.

Figures (10)

  • Figure 1: Comparison of WikiText-2 perplexity against various baselines on Llama-2 13B model.
  • Figure 2: Decoder structure before (left) and after (right) weight truncation. Orange blocks indicate truncated weights; hatched areas show removed weights; blue boxes denote non-linear functions.
  • Figure 3: Fine-grained head-wise PCA in value layer.
  • Figure 4: Remaining rank ratio versus layer id computed with Algorithm \ref{['alg:1']}. The average remaining ratio is set between 30% (lowest solid) to 90% (highest solid).
  • Figure 5: Comparison of inference throughput and memory usage with prior low-rank-based methods.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Theorem 4.1: Reconstruction Error of Single-head Output PCA Projection
  • Corollary 4.2: Reconstruction Error of Multi-head Output PCA Projection
  • proof
  • proof