Table of Contents
Fetching ...

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu

TL;DR

This paper tackles the efficiency bottlenecks of large language models by introducing FLRC, a fine-grained, rank-aware compression framework. It pairs Fisher-based layer-wise rank allocation with progressive low-rank decoding to adaptively allocate and shrink model capacity during generation. Empirically, FLRC delivers substantial gains in generation quality (e.g., ROUGE-L on summarization) and robust understanding performance, while dramatically reducing the cost of rank-search and maintaining speedups during decoding. The approach offers a practical path to deploying high-performing LLMs on resource-constrained hardware, with notable impact for on-device and edge scenarios.

Abstract

Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

TL;DR

This paper tackles the efficiency bottlenecks of large language models by introducing FLRC, a fine-grained, rank-aware compression framework. It pairs Fisher-based layer-wise rank allocation with progressive low-rank decoding to adaptively allocate and shrink model capacity during generation. Empirically, FLRC delivers substantial gains in generation quality (e.g., ROUGE-L on summarization) and robust understanding performance, while dramatically reducing the cost of rank-search and maintaining speedups during decoding. The approach offers a practical path to deploying high-performing LLMs on resource-constrained hardware, with notable impact for on-device and edge scenarios.

Abstract

Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

Paper Structure

This paper contains 21 sections, 5 equations, 2 figures, 14 tables, 1 algorithm.

Figures (2)

  • Figure 1: The differences between FLRC and traditional low-rank compression. As shown on the left side of the figure, we can determine the optimal number of ranks to preserve for each layer. On the right side, during the decoding stage, our approach gradually reduces the model's overall activated rank as more tokens are generated, unlike previous static methods, thereby decreasing the parameter usage and computational requirements while maintaining the quality of the generated output.
  • Figure 2: The importance score of various projections in Llama-3-8B across different layer indices. Each point represents a projection's score; higher scores (e.g., "down_proj") indicate that less compression should be applied, while lower scores allow for more aggressive compression.