FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang
TL;DR
FLAT-LLM tackles the challenge of deploying large language models in constrained environments by introducing a training-free, fine-grained structural compression that operates in the activation space. It uses head-wise PCA to compress the value and output projections within multi-head attention and a greedy, importance-aware rank selection to allocate ranks across decoder layers, enabling substantial model-size reduction without recovery fine-tuning. The approach achieves strong language modeling and downstream task performance across multiple models with meaningful inference speedups and compatibility with post-training quantization. A theoretical analysis links truncation loss to discarded eigenvalues, and extensive experiments demonstrate FLAT-LLM’s superior generalization and practical deployability compared to prior low-rank and pruning baselines.
Abstract
Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis, and employ a greedy budget redistribution strategy to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 5 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.
