Table of Contents
Fetching ...

Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training

Rui Pan, Shivanshu Shekhar, Boyao Wang, Shizhe Diao, Jipeng Zhang, Xingyuan Pan, Renjie Pi, Tong Zhang

TL;DR

Adapt-Pruner tackles the high cost of training small language models by introducing a layer-wise, mapping-preserving structured pruning strategy that assigns per-layer sparsity based on input–output mapping importance. Built atop this, Adapt-Accel interleaves pruning with recovery training to achieve targeted sparsity with minimal performance loss, enabling rapid adaptation of large models into compact, capable Adapt-LLMs. Empirical results on LLaMA-3.1-8B show improved commonsense accuracy over existing pruning methods, while Adapt-Accel recovers MobileLLMs from larger counterparts at ≈200× fewer training tokens and discovers 1B variants surpassing certain baselines. The combination of pruning-driven efficiency and interleaved recovery enables flexible, cost-effective model customization with practical deployment benefits, and the released code facilitates reproducibility and broader adoption.$

Abstract

Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks. The official code is released at https://github.com/research4pan/AdaptPruner.

Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training

TL;DR

Adapt-Pruner tackles the high cost of training small language models by introducing a layer-wise, mapping-preserving structured pruning strategy that assigns per-layer sparsity based on input–output mapping importance. Built atop this, Adapt-Accel interleaves pruning with recovery training to achieve targeted sparsity with minimal performance loss, enabling rapid adaptation of large models into compact, capable Adapt-LLMs. Empirical results on LLaMA-3.1-8B show improved commonsense accuracy over existing pruning methods, while Adapt-Accel recovers MobileLLMs from larger counterparts at ≈200× fewer training tokens and discovers 1B variants surpassing certain baselines. The combination of pruning-driven efficiency and interleaved recovery enables flexible, cost-effective model customization with practical deployment benefits, and the released code facilitates reproducibility and broader adoption.$

Abstract

Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons (5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200 fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks. The official code is released at https://github.com/research4pan/AdaptPruner.

Paper Structure

This paper contains 37 sections, 7 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Layer sensitivity and pruned Models. The first row of figures shows the increase in perplexity when a single decoder layer is pruned at 50% sparsity, compared to the dense LLaMA-3.1-8B model, as well as models uniformly pruned across all layers at 10% and 20% sparsity. The second row of figures illustrates the architecture of the pruned models, with each decoder layer represented by its corresponding number of parameters.
  • Figure 2: Adapt-Pruner: measuring the distance between each decoder layer's input and output tensors to assess its importance and assigning a corresponding sparsity. Based on this assigned sparsity, the coupled weights in each decoder layer are pruned accordingly.
  • Figure 3: Adapt-Accel: Incremental pruning with interleaved training. $N_P$ number of interleaves are adopted in the whole process. Given a model with size $|\mathcal{L}_{\text{large}}|$ and target size $|\mathcal{L}_{\text{small}}|$, this leads to an incremental pruning ratio of $P = (|\mathcal{L}_{\text{small}}| / |\mathcal{L}_{\text{large}}|)^{1/N_P}$ each time, where the training set is randomly split into $N_P$ subsets for $N_P$ interleaved trainings separately. Notice that the number of training samples gradually increases according to Algorithm \ref{['alg:adapt_pruning']}, as more important weights are expected to be pruned in later phases.
  • Figure 4: Ablation studies over pruning ratio per interleaved training, which shows the optimal value is $\sim 95\%$, meaning it is best to interleave the training of SLM after every $\sim 5\%$ weight/neuron removals.