Table of Contents
Fetching ...

PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Bowei He, Lihao Yin, Hui-Ling Zhen, Xiaokun Zhang, Mingxuan Yuan, Chen Ma

TL;DR

PASER addresses the uneven deterioration of capabilities in pruned LLMs by coupling semantic-structural instruction clustering with capability degradation-aware data selection and a graph-based mechanism to mitigate negative tuning effects. By adaptively allocating the data budget across clusters and prioritizing samples that most impact degraded capabilities, PASER achieves near-unpruned performance using only a fraction of post-training data (4%-20%) across diverse models and pruning schemes. The framework is backed by theoretical analysis and extensive experiments spanning language modeling, reasoning, math, and code tasks, demonstrating both accuracy gains and substantial efficiency improvements. This approach offers a practical, scalable route to robust post-pruning recovery and broad applicability to future compression techniques.

Abstract

Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the \textbf{P}ost-training d\textbf{A}ta \textbf{S}election method for \textbf{E}fficient pruned large language model \textbf{R}ecovery (\textbf{PASER}). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4\%-20\% of the original post-training data. We provide the anonymous code repository in \href{https://anonymous.4open.science/r/PASER-E606}{Link}.

PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

TL;DR

PASER addresses the uneven deterioration of capabilities in pruned LLMs by coupling semantic-structural instruction clustering with capability degradation-aware data selection and a graph-based mechanism to mitigate negative tuning effects. By adaptively allocating the data budget across clusters and prioritizing samples that most impact degraded capabilities, PASER achieves near-unpruned performance using only a fraction of post-training data (4%-20%) across diverse models and pruning schemes. The framework is backed by theoretical analysis and extensive experiments spanning language modeling, reasoning, math, and code tasks, demonstrating both accuracy gains and substantial efficiency improvements. This approach offers a practical, scalable route to robust post-pruning recovery and broad applicability to future compression techniques.

Abstract

Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the \textbf{P}ost-training d\textbf{A}ta \textbf{S}election method for \textbf{E}fficient pruned large language model \textbf{R}ecovery (\textbf{PASER}). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4\%-20\% of the original post-training data. We provide the anonymous code repository in \href{https://anonymous.4open.science/r/PASER-E606}{Link}.

Paper Structure

This paper contains 52 sections, 2 theorems, 29 equations, 6 figures, 18 tables, 1 algorithm.

Key Result

Theorem 1

The overall time complexity of PASER is $O(N\log N + NC^2)$, where $N$ is the number of instructions in $D$, and $C$ is the maximum number of concepts in any instruction tuning sample.

Figures (6)

  • Figure 1: Visualization for our proposed PASER recovery post-training data selection framework.
  • Figure 2: Average reasoning performance and recovery post-training time consumption curves corresponding to different instruction tuning data selection methods. The left two subfigures are for Alpaca while right two subfigures are for LaMini.
  • Figure 3: Average performance on seven common LLM reasoning evaluation tasks after recovery post-training with different data. The numbers in brackets represent the group index of the data subset in the full dataset. Unpruned indicates the original model and w/o Training indicates the pruned model (using LLM-Pruner ma2023llm) without the recovery post-training.
  • Figure 4: Normalized performance degradation degree($\%$) on four various capabilities under four LLM pruning settings.
  • Figure 5: (a) Sensitivity to embedding dimension $d$ after manifold learning, here the clustering results under $d=16$ is taken as the reference; (b) Robustness to temperature parameter $\tau$ in Equation \ref{['equ:probability']}.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: Concept Consistency Graph
  • Theorem 1
  • Theorem 2: PASER Error Bound