CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning
Fanxu Meng, Pingzhi Tang, Fan jiang, Muhan Zhang
TL;DR
CLOVER addresses memory-bound inference in decoder-only transformers by cross-layer orthogonalizing Q-K and V-O vectors within each attention head using SVD, producing a small set of singular values that guide pruning or parameter-efficient fine-tuning. By factorizing the merged cross-layer weights per head and freezing the orthogonal bases while tuning only the singular values, CLOVER achieves high pruning efficiency and competitive or superior full-rank updates with a parameter footprint similar to LoRA. Empirical results show CLOVER enables large pruning ratios with minimal performance loss, and, when used for full-rank fine-tuning, outperforms state-of-the-art PEFT methods across multiple LLaMA variants on commonsense tasks; it also demonstrates robust pruning in Whisper and other models, with visual evidence of reduced redundancy in attention heads. The method offers practical impact by reducing memory and compute for storage-heavy KV caches and by enabling efficient fine-tuning with limited resources, while acknowledging limitations around nonlinear position encodings and potential compatibility considerations with certain architectures.
Abstract
Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bound. To address this issue, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel approach that treats pairs of attention layers as a set of low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the \( Q \)-\( K \) and \( V \)-\( O \) pairs within each attention head. The resulting singular values can either guide pruning or serve as trainable parameters for efficient fine-tuning of all orthogonal vectors. After pruning or fine-tuning, these values are reintegrated into the model without increasing its parameter count. We apply CLOVER to various models, including GPT-2 XL, DeepSeek-V2-Lite, Whisper-Large-v3, Stable Diffusion XL, and LLaMA-3.2-11B-Vision. Our results demonstrate that CLOVER significantly improves pruning efficiency. For instance, the perplexity of pruning 70\% of the \( Q \)-\( K \) pairs in GPT-2 XL is similar to that of pruning just 8\% with vanilla methods. Fine-tuning the singular values further results in a full-rank update, outperforming state-of-the-art methods (LoRA, DoRA, HiRA, and PiSSA) by 7.6\%, 5.5\%, 3.8\%, and 0.7\%, respectively, on eight commonsense tasks for LLaMA-2 7B.
