Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs
Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Angelica I Aviles-Rivero, Chuanlong Xie, Yao Zhu
TL;DR
Explores depth-wise pruning in large language models and reveals patch-like redundancy between consecutive Transformer layers using $CKA$ in RKHS. Proposes Sliding-Window Merging (SWM), a dynamic layer-merging method guided by representational similarity and parameter consolidation, followed by a quick recovery via LoRA. Empirically, SWM outperforms state-of-the-art pruning baselines in zero-shot tasks and inference throughput across multiple models, and can be effectively combined with width pruning for further gains. The work provides a practical framework for efficient deployment of large language models in resource-constrained environments.
Abstract
Depth-wise pruning accelerates LLM inference in resource-constrained scenarios but suffers from performance degradation due to direct removal of entire Transformer layers. This paper reveals ``Patch-like'' redundancy across layers via correlation analysis of the outputs of different layers in reproducing kernel Hilbert space, demonstrating consecutive layers exhibit high functional similarity. Building on this observation, this paper proposes Sliding-Window Merging (SWM) - a dynamic compression method that selects consecutive layers from top to bottom using a pre-defined similarity threshold, and compacts patch-redundant layers through a parameter consolidation, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.
