FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu
TL;DR
This paper tackles the efficiency bottlenecks of large language models by introducing FLRC, a fine-grained, rank-aware compression framework. It pairs Fisher-based layer-wise rank allocation with progressive low-rank decoding to adaptively allocate and shrink model capacity during generation. Empirically, FLRC delivers substantial gains in generation quality (e.g., ROUGE-L on summarization) and robust understanding performance, while dramatically reducing the cost of rank-search and maintaining speedups during decoding. The approach offers a practical path to deploying high-performing LLMs on resource-constrained hardware, with notable impact for on-device and edge scenarios.
Abstract
Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.
