All-in-One Tuning and Structural Pruning for Domain-Specific LLMs
Lei Lu, Zhepeng Wang, Runxue Bao, Mengbing Wang, Fangyi Li, Yawen Wu, Weiwen Jiang, Jie Xu, Yanzhi Wang, Shangqian Gao
TL;DR
The paper tackles suboptimal domain-specific LLM pruning by bridging pruning decisions with ongoing fine-tuning. It introduces ATP, a one-stage framework that jointly searches pruning patterns via a trainable pruning-decision generator and performs LoRA-based tuning with LoRA-aware forward passes and sparsity regularization, enabling direct removal of pruned structures after training. Empirical results in HealthCare and Legal domains show ATP outperforms traditional two-stage pruning methods, maintaining close-to-dense performance at moderate sparsity (e.g., 40–50%), and achieving up to 88%–91% relative recovery for large models. This approach offers a practical, data-efficient path to deploy domain-specific LLMs with substantial reductions in parameter count while preserving important capabilities.
Abstract
Existing pruning techniques for large language models (LLMs) targeting domain-specific applications typically follow a two-stage process: pruning the pretrained general-purpose LLMs and then fine-tuning the pruned LLMs on specific domains. However, the pruning decisions, derived from the pretrained weights, remain unchanged during fine-tuning, even if the weights have been updated. Therefore, such a combination of the pruning decisions and the finetuned weights may be suboptimal, leading to non-negligible performance degradation. To address these limitations, we propose ATP: All-in-One Tuning and Structural Pruning, a unified one-stage structural pruning and fine-tuning approach that dynamically identifies the current optimal substructure throughout the fine-tuning phase via a trainable pruning decision generator. Moreover, given the limited available data for domain-specific applications, Low-Rank Adaptation (LoRA) becomes a common technique to fine-tune the LLMs. In ATP, we introduce LoRA-aware forward and sparsity regularization to ensure that the substructures corresponding to the learned pruning decisions can be directly removed after the ATP process. ATP outperforms the state-of-the-art two-stage pruning methods on tasks in the legal and healthcare domains. More specifically, ATP recovers up to 88% and 91% performance of the dense model when pruning 40% parameters of LLaMA2-7B and LLaMA3-8B models, respectively.
