Table of Contents
Fetching ...

CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning

Fanxu Meng, Pingzhi Tang, Fan jiang, Muhan Zhang

TL;DR

CLOVER addresses memory-bound inference in decoder-only transformers by cross-layer orthogonalizing Q-K and V-O vectors within each attention head using SVD, producing a small set of singular values that guide pruning or parameter-efficient fine-tuning. By factorizing the merged cross-layer weights per head and freezing the orthogonal bases while tuning only the singular values, CLOVER achieves high pruning efficiency and competitive or superior full-rank updates with a parameter footprint similar to LoRA. Empirical results show CLOVER enables large pruning ratios with minimal performance loss, and, when used for full-rank fine-tuning, outperforms state-of-the-art PEFT methods across multiple LLaMA variants on commonsense tasks; it also demonstrates robust pruning in Whisper and other models, with visual evidence of reduced redundancy in attention heads. The method offers practical impact by reducing memory and compute for storage-heavy KV caches and by enabling efficient fine-tuning with limited resources, while acknowledging limitations around nonlinear position encodings and potential compatibility considerations with certain architectures.

Abstract

Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bound. To address this issue, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel approach that treats pairs of attention layers as a set of low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the \( Q \)-\( K \) and \( V \)-\( O \) pairs within each attention head. The resulting singular values can either guide pruning or serve as trainable parameters for efficient fine-tuning of all orthogonal vectors. After pruning or fine-tuning, these values are reintegrated into the model without increasing its parameter count. We apply CLOVER to various models, including GPT-2 XL, DeepSeek-V2-Lite, Whisper-Large-v3, Stable Diffusion XL, and LLaMA-3.2-11B-Vision. Our results demonstrate that CLOVER significantly improves pruning efficiency. For instance, the perplexity of pruning 70\% of the \( Q \)-\( K \) pairs in GPT-2 XL is similar to that of pruning just 8\% with vanilla methods. Fine-tuning the singular values further results in a full-rank update, outperforming state-of-the-art methods (LoRA, DoRA, HiRA, and PiSSA) by 7.6\%, 5.5\%, 3.8\%, and 0.7\%, respectively, on eight commonsense tasks for LLaMA-2 7B.

CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning

TL;DR

CLOVER addresses memory-bound inference in decoder-only transformers by cross-layer orthogonalizing Q-K and V-O vectors within each attention head using SVD, producing a small set of singular values that guide pruning or parameter-efficient fine-tuning. By factorizing the merged cross-layer weights per head and freezing the orthogonal bases while tuning only the singular values, CLOVER achieves high pruning efficiency and competitive or superior full-rank updates with a parameter footprint similar to LoRA. Empirical results show CLOVER enables large pruning ratios with minimal performance loss, and, when used for full-rank fine-tuning, outperforms state-of-the-art PEFT methods across multiple LLaMA variants on commonsense tasks; it also demonstrates robust pruning in Whisper and other models, with visual evidence of reduced redundancy in attention heads. The method offers practical impact by reducing memory and compute for storage-heavy KV caches and by enabling efficient fine-tuning with limited resources, while acknowledging limitations around nonlinear position encodings and potential compatibility considerations with certain architectures.

Abstract

Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bound. To address this issue, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel approach that treats pairs of attention layers as a set of low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the - and - pairs within each attention head. The resulting singular values can either guide pruning or serve as trainable parameters for efficient fine-tuning of all orthogonal vectors. After pruning or fine-tuning, these values are reintegrated into the model without increasing its parameter count. We apply CLOVER to various models, including GPT-2 XL, DeepSeek-V2-Lite, Whisper-Large-v3, Stable Diffusion XL, and LLaMA-3.2-11B-Vision. Our results demonstrate that CLOVER significantly improves pruning efficiency. For instance, the perplexity of pruning 70\% of the - pairs in GPT-2 XL is similar to that of pruning just 8\% with vanilla methods. Fine-tuning the singular values further results in a full-rank update, outperforming state-of-the-art methods (LoRA, DoRA, HiRA, and PiSSA) by 7.6\%, 5.5\%, 3.8\%, and 0.7\%, respectively, on eight commonsense tasks for LLaMA-2 7B.

Paper Structure

This paper contains 25 sections, 9 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: (a) We treat the Query-Key and Value-Output layers within a single attention head as a unified structure. (b) Apply SVD to obtain two sets of singular vectors for initializing the Q-K and V-O layers, along with singular values that guide pruning or enable efficient full-rank fine-tuning. (c) This cross-layer orthogonalization strategy allows for higher pruning rates. (d) The pruned model maintains strong performance after fine-tuning.
  • Figure 2: CLOVER (orange) uses fewer orthogonal basis vectors than Vanilla Pruning (blue) to span the attention head space. The first row shows the importance of Q-K dimensions, and the second row shows V-O dimensions. After the red dot, CLOVER's importance is lower, and pruning these vectors results in less performance loss.
  • Figure 3: An audio waveform from the librispeech dataset.
  • Figure 4: Proportion of data projections across different components in random directions (LoRA) versus orthogonal directions (PiSSA), as well as all orthogonal directions (CLOVER).
  • Figure 5: $\Delta W$ is low rank in LoRA, while full rank for Full-Fine-Tuning and CLOVER.
  • ...and 3 more figures