Table of Contents
Fetching ...

Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Xi Chen, Cunhang Fan, Zhao Lv, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui

TL;DR

Large language models pose deployment challenges due to scale. We propose Manifold-Based Knowledge Alignment and Layer Merging (MKA), which first maps per-layer activations into low-dimensional manifolds via diffusion maps and then merges similar layers by optimizing a mutual-information–driven similarity, approximated with $\alpha \approx S_{lm}$ and merged as $\tilde{\boldsymbol{\theta}}_c=\alpha\boldsymbol{\theta}_l+(1-\alpha)\boldsymbol{\theta}_m$. The method leverages the Information Bottleneck objective to preserve relevant information while compressing, yielding substantial compression with minimal accuracy loss; e.g., on MMLU with Llama3-8B, MKA achieves 43.75% compression with a 2.82% drop, and benefits further when combined with quantization (e.g., SmoothQuant, GPTQ, AWQ). Across multiple benchmarks and models, MKA outperforms traditional pruning in both compression rate and accuracy retention, offering a scalable, hardware-friendly path for deploying efficient LLMs.

Abstract

While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that uses manifold learning and the Normalized Pairwise Information Bottleneck (NPIB) measure to merge similar layers, reducing model size while preserving essential performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a compression ratio of 43.75% with a minimal performance decrease of only 2.82\%. The proposed MKA method offers a resource-efficient and performance-preserving model compression technique for LLMs.

Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

TL;DR

Large language models pose deployment challenges due to scale. We propose Manifold-Based Knowledge Alignment and Layer Merging (MKA), which first maps per-layer activations into low-dimensional manifolds via diffusion maps and then merges similar layers by optimizing a mutual-information–driven similarity, approximated with and merged as . The method leverages the Information Bottleneck objective to preserve relevant information while compressing, yielding substantial compression with minimal accuracy loss; e.g., on MMLU with Llama3-8B, MKA achieves 43.75% compression with a 2.82% drop, and benefits further when combined with quantization (e.g., SmoothQuant, GPTQ, AWQ). Across multiple benchmarks and models, MKA outperforms traditional pruning in both compression rate and accuracy retention, offering a scalable, hardware-friendly path for deploying efficient LLMs.

Abstract

While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that uses manifold learning and the Normalized Pairwise Information Bottleneck (NPIB) measure to merge similar layers, reducing model size while preserving essential performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a compression ratio of 43.75% with a minimal performance decrease of only 2.82\%. The proposed MKA method offers a resource-efficient and performance-preserving model compression technique for LLMs.

Paper Structure

This paper contains 35 sections, 36 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Manifold-Based Knowledge Alignment and Layer Merging (MKA) framework consists of two main components: (1) The left side illustrates manifold learning for LLM knowledge extraction, where layer activations are transformed into low-dimensional manifolds using the Diffusion Kernel algorithm. (2) The right side depicts the similarity-based layer merging process, employing the IB metric to identify layers with aligned knowledge.
  • Figure 2: Performance (Accuracy) of LLMs (Llama2-7B, Llama2-13B, Llama3-8B, Llama3.2-3B, and Mistral-7B) on the MMLU dataset as the pruning ratio of various pruning methods increases.
  • Figure 3: Similarity matrices for Llama2-7B, Llama2-13B, Llama-3-8B, Llama3.2-3B, and Mistral-7B before MKA. Later layers show high similarity, supporting layer merging.
  • Figure 4: The similarity matrix of Mixtral-8x7B and Jamba model.
  • Figure 5: Similarity matrices for various measures in the Llama3-8B model, showing different patterns and effectiveness in capturing layer relationships, with none fully matching the expected merging patterns.
  • ...and 1 more figures