Table of Contents
Fetching ...

KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models

Bo Lv, Quan Zhou, Xuanang Ding, Yan Wang, Zeming Ma

TL;DR

This paper proposes KVPruner, a method to improve model efficiency while maintaining performance by using global perplexity-based analysis to determine the importance ratio for each block and providing multiple strategies to prune non-essential KV channels within blocks.

Abstract

The bottleneck associated with the key-value(KV) cache presents a significant challenge during the inference processes of large language models. While depth pruning accelerates inference, it requires extensive recovery training, which can take up to two weeks. On the other hand, width pruning retains much of the performance but offers slight speed gains. To tackle these challenges, we propose KVPruner to improve model efficiency while maintaining performance. Our method uses global perplexity-based analysis to determine the importance ratio for each block and provides multiple strategies to prune non-essential KV channels within blocks. Compared to the original model, KVPruner reduces runtime memory usage by 50% and boosts throughput by over 35%. Additionally, our method requires only two hours of LoRA fine-tuning on small datasets to recover most of the performance.

KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models

TL;DR

This paper proposes KVPruner, a method to improve model efficiency while maintaining performance by using global perplexity-based analysis to determine the importance ratio for each block and providing multiple strategies to prune non-essential KV channels within blocks.

Abstract

The bottleneck associated with the key-value(KV) cache presents a significant challenge during the inference processes of large language models. While depth pruning accelerates inference, it requires extensive recovery training, which can take up to two weeks. On the other hand, width pruning retains much of the performance but offers slight speed gains. To tackle these challenges, we propose KVPruner to improve model efficiency while maintaining performance. Our method uses global perplexity-based analysis to determine the importance ratio for each block and provides multiple strategies to prune non-essential KV channels within blocks. Compared to the original model, KVPruner reduces runtime memory usage by 50% and boosts throughput by over 35%. Additionally, our method requires only two hours of LoRA fine-tuning on small datasets to recover most of the performance.
Paper Structure (15 sections, 6 equations, 3 figures, 4 tables)

This paper contains 15 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The inference results of the pruned LLaMA-7B model on an NVIDIA A100 GPU, measuring computation latency and throughput with caching disabled. Left side: The top left compares perplexity(PPL) across different strategies under the same pruning ratio and fine-tuning steps, where our method demonstrates superior performance. The bottom left shows the key-value (KV) cache usage, where our approach achieves more significant KV memory pruning at both strategy-level and model parameter-level pruning ratios. Right side: Under the same model parameter settings, KVPruner achieves faster inference speeds compared to Shortened-LLM kim2024shortened and LLM-Pruner ma2023llm pruning method.
  • Figure 2: Illustrates the simplified workflow of the KVPruner in LLMs. The pruning process consists of two main steps: First, global sensitivity analysis assigns the optimal pruning ratio to each block. Second, local channel sensitivity aggregates the importance of Q, K, V and O channels for evaluation and removes the less important ones. After completing these steps, LoRA is applied to quickly recovery the performance.
  • Figure 3: The first three graphs show the range of QKV weight changes in a specific block. Estimated on the evaluation set, the fourth graph illustrates the sensitivity changes after removing the KV from the block.