Table of Contents
Fetching ...

LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

Guangyan Li, Yongqiang Tang, Wensheng Zhang

TL;DR

LoRAP tackles the large-model compression challenge by differentiating sub-layer treatment: exploiting the low-rank structure of multi-head self-attention with Activation Weighted SVD, and applying gradient-free structured channel pruning to FFN, paired with LoRA-based knowledge recovery. The method is validated on LLaMA-1/2 and Vicuna models across 7B–13B scales, showing superior performance to prior structured approaches at multiple compression ratios and enabling notable reductions in parameters, MACs, and latency. Key contributions include the Discovery of sub-layer rank disparities, the AWSVD technique, a gradient-free FFN pruning strategy with least-important-weight retention, and a LoRA-based recovery pipeline. The results highlight the practical potential of differentiated structured compression for efficient, task-agnostic deployment of large language models.

Abstract

Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a mixed compression model, which organically combines Low-Rank matrix approximation And structured Pruning (LoRAP). For the MHA sub-layer, we propose an input activation weighted singular value decomposition method to strengthen the low-rank characteristic. Furthermore, we discover that the weight matrices in MHA sub-layer have different low-rank degrees. Thus, a novel parameter allocation scheme according to the discrepancy of low-rank degrees is devised. For the FFN sub-layer, we propose a gradient-free structured channel pruning method. During the pruning, we get an interesting finding that the least important 1% of parameter actually play a vital role in model performance. Extensive evaluations on zero-shot perplexity and zero-shot task classification indicate that our proposal is superior to previous structured compression rivals under multiple compression ratios.

LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

TL;DR

LoRAP tackles the large-model compression challenge by differentiating sub-layer treatment: exploiting the low-rank structure of multi-head self-attention with Activation Weighted SVD, and applying gradient-free structured channel pruning to FFN, paired with LoRA-based knowledge recovery. The method is validated on LLaMA-1/2 and Vicuna models across 7B–13B scales, showing superior performance to prior structured approaches at multiple compression ratios and enabling notable reductions in parameters, MACs, and latency. Key contributions include the Discovery of sub-layer rank disparities, the AWSVD technique, a gradient-free FFN pruning strategy with least-important-weight retention, and a LoRA-based recovery pipeline. The results highlight the practical potential of differentiated structured compression for efficient, task-agnostic deployment of large language models.

Abstract

Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a mixed compression model, which organically combines Low-Rank matrix approximation And structured Pruning (LoRAP). For the MHA sub-layer, we propose an input activation weighted singular value decomposition method to strengthen the low-rank characteristic. Furthermore, we discover that the weight matrices in MHA sub-layer have different low-rank degrees. Thus, a novel parameter allocation scheme according to the discrepancy of low-rank degrees is devised. For the FFN sub-layer, we propose a gradient-free structured channel pruning method. During the pruning, we get an interesting finding that the least important 1% of parameter actually play a vital role in model performance. Extensive evaluations on zero-shot perplexity and zero-shot task classification indicate that our proposal is superior to previous structured compression rivals under multiple compression ratios.
Paper Structure (26 sections, 15 equations, 6 figures, 17 tables, 1 algorithm)

This paper contains 26 sections, 15 equations, 6 figures, 17 tables, 1 algorithm.

Figures (6)

  • Figure 1: The compression of the transformer layer. For the FFN sub-layer, we prune the neurons in the intermediate layer. For the MHA sub-layer, we employ weighted SVD to obtain two low-rank matrices as an approximation to the original matrix.
  • Figure 2: Visualization of the $\mathbf{W}_{q}$, $\mathbf{W}_{k}$, $\mathbf{W}_{v}$, and $\mathbf{W}_{o}$ matrices in the first MHA sub-layer at 50% sparsity. The black areas in the figure represent the pruned weights, while the white areas indicate the retaining weights.
  • Figure 3: The three images on the left are visualization of the $\mathbf{W}_{down}$, $\mathbf{W}_{gate}$, and $\mathbf{W}_{up}$ matrices (From left to right). The three images on the right are areas of size $800\times420$.
  • Figure 4: Different proportions are reserved for different weight matrices in MHA sub-layer. The parameters are increased by 0.5 each time from left to right. The perplexity (PPL) of the obtained model on Wikitext2 (left) and PTB (right) is present.
  • Figure 5: As the quantity and length of calibration data increases, the evaluation results of the compressed model on WikiText2 and PTB.
  • ...and 1 more figures