Table of Contents
Fetching ...

Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs

Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Angelica I Aviles-Rivero, Chuanlong Xie, Yao Zhu

TL;DR

Explores depth-wise pruning in large language models and reveals patch-like redundancy between consecutive Transformer layers using $CKA$ in RKHS. Proposes Sliding-Window Merging (SWM), a dynamic layer-merging method guided by representational similarity and parameter consolidation, followed by a quick recovery via LoRA. Empirically, SWM outperforms state-of-the-art pruning baselines in zero-shot tasks and inference throughput across multiple models, and can be effectively combined with width pruning for further gains. The work provides a practical framework for efficient deployment of large language models in resource-constrained environments.

Abstract

Depth-wise pruning accelerates LLM inference in resource-constrained scenarios but suffers from performance degradation due to direct removal of entire Transformer layers. This paper reveals ``Patch-like'' redundancy across layers via correlation analysis of the outputs of different layers in reproducing kernel Hilbert space, demonstrating consecutive layers exhibit high functional similarity. Building on this observation, this paper proposes Sliding-Window Merging (SWM) - a dynamic compression method that selects consecutive layers from top to bottom using a pre-defined similarity threshold, and compacts patch-redundant layers through a parameter consolidation, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.

Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs

TL;DR

Explores depth-wise pruning in large language models and reveals patch-like redundancy between consecutive Transformer layers using in RKHS. Proposes Sliding-Window Merging (SWM), a dynamic layer-merging method guided by representational similarity and parameter consolidation, followed by a quick recovery via LoRA. Empirically, SWM outperforms state-of-the-art pruning baselines in zero-shot tasks and inference throughput across multiple models, and can be effectively combined with width pruning for further gains. The work provides a practical framework for efficient deployment of large language models in resource-constrained environments.

Abstract

Depth-wise pruning accelerates LLM inference in resource-constrained scenarios but suffers from performance degradation due to direct removal of entire Transformer layers. This paper reveals ``Patch-like'' redundancy across layers via correlation analysis of the outputs of different layers in reproducing kernel Hilbert space, demonstrating consecutive layers exhibit high functional similarity. Building on this observation, this paper proposes Sliding-Window Merging (SWM) - a dynamic compression method that selects consecutive layers from top to bottom using a pre-defined similarity threshold, and compacts patch-redundant layers through a parameter consolidation, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.

Paper Structure

This paper contains 20 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: CKA (Center Kernel Alignment) metric between pairs of Transformer layers in LLMs.
  • Figure 2: The framework of our sliding-windows merging (SWM) method. (a) Model initialization establishes upper / lower bounds for the sliding window. (b) Layer merging via parameter consolidation: $\theta^*_l$ denotes merged parameters combining adjacent layers within the window. (c) Similarity validation computes cosine similarity between original and compressed models' outputs using few-shot evaluation. (d) Adaptive window adjustment: the window slides downward if similarity meets thresholds (Status 1), otherwise the compressed model updates and window resets (Status 2). Color coding: gray blocks (original LLM layers), red-orange window (active merging region), green / blue markers (current upper / lower bounds).
  • Figure 3: Performance of the integrated method on LLaMA2-7B. The horizontal axis shows the depth pruning ratio, and the vertical axis indicates zero-shot task performance. Dotted lines denote individual task metrics (Winogrande, HellaSwag, ARC-easy), while the solid line shows the average across all seven tasks.
  • Figure 4: Performance with/without LoRA retraining. The blue column shows performance before LoRA fine-tuning, and the orange column after.
  • Figure 5: The impact of layer merging methods on LLaMA2-7B model. (a) Layer counts of the pruned model under different similarity thresholds through three layer merging methods: Delete (blue circles), Average (green squares), Ours (orange diamonds). (b) Zero-shot performance of the pruned model on the HellaSwag dataset through three layer merging methods: Delete (blue bars), Average (green bars), Ours (orange bars).