Table of Contents
Fetching ...

EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning

Songlin Zhao, Michael Pitts, Zhuwei Qin

TL;DR

EfficientXpert tackles the challenge of domain-adaptive compression for large language models by marrying a propagation-aware pruning criterion (ForeSight Mask) with an adapter realignment step (Partial Brain Surgeon) within LoRA fine-tuning. The framework accounts for forward error propagation across layers and performs a post-hoc alignment of low-rank adapters to the surviving subnetwork, enabling a one-shot transformation from a dense pretrained model to a sparse, domain-specialized expert. Across health and legal domains, EfficientXpert consistently outperforms existing domain-pruning baselines, achieving near-dense performance at substantial sparsity (e.g., 40% sparsity) and revealing that domain shifts, not tasks, largely drive pruning sensitivity. These findings highlight the necessity of domain-adaptive pruning strategies to realize practical, resource-efficient LLM deployment in specialized domains.

Abstract

The rapid advancement of large language models (LLMs) has increased the demand for domain-specialized variants in areas such as law, healthcare, and finance. However, their large size remains a barrier to deployment in resource-constrained environments, and existing compression methods either generalize poorly across domains or incur high overhead. In this work, we propose \textbf{EfficientXpert}, a lightweight domain-pruning framework that combines a propagation-aware pruning criterion (Foresight Mask) with an efficient adapter-update algorithm (Partial Brain Surgeon). Integrated into the LoRA fine-tuning process, EfficientXpert enables a one-step transformation of general pretrained models into sparse, domain-adapted experts. Across health and legal tasks, it retains up to 98% of dense-model performance at 40% sparsity, outperforming state-of-the-art methods. Further analysis reveals substantial domain-dependent structural shifts that degrade the effectiveness of general pruning masks, underscoring the need for adaptive, domain-aware pruning strategies tailored to each domain.

EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning

TL;DR

EfficientXpert tackles the challenge of domain-adaptive compression for large language models by marrying a propagation-aware pruning criterion (ForeSight Mask) with an adapter realignment step (Partial Brain Surgeon) within LoRA fine-tuning. The framework accounts for forward error propagation across layers and performs a post-hoc alignment of low-rank adapters to the surviving subnetwork, enabling a one-shot transformation from a dense pretrained model to a sparse, domain-specialized expert. Across health and legal domains, EfficientXpert consistently outperforms existing domain-pruning baselines, achieving near-dense performance at substantial sparsity (e.g., 40% sparsity) and revealing that domain shifts, not tasks, largely drive pruning sensitivity. These findings highlight the necessity of domain-adaptive pruning strategies to realize practical, resource-efficient LLM deployment in specialized domains.

Abstract

The rapid advancement of large language models (LLMs) has increased the demand for domain-specialized variants in areas such as law, healthcare, and finance. However, their large size remains a barrier to deployment in resource-constrained environments, and existing compression methods either generalize poorly across domains or incur high overhead. In this work, we propose \textbf{EfficientXpert}, a lightweight domain-pruning framework that combines a propagation-aware pruning criterion (Foresight Mask) with an efficient adapter-update algorithm (Partial Brain Surgeon). Integrated into the LoRA fine-tuning process, EfficientXpert enables a one-step transformation of general pretrained models into sparse, domain-adapted experts. Across health and legal tasks, it retains up to 98% of dense-model performance at 40% sparsity, outperforming state-of-the-art methods. Further analysis reveals substantial domain-dependent structural shifts that degrade the effectiveness of general pruning masks, underscoring the need for adaptive, domain-aware pruning strategies tailored to each domain.

Paper Structure

This paper contains 44 sections, 17 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Comparison of EfficientXpert with existing domain pruning methods. (b) Overview of EfficientXpert framework, including ForeSight Mask and Partial Brain Surgeon. EfficientXpert iteratively updates adapters, smooths importance scores, applies corrections to surviving weights, and merges the mask into dense weights to create a sparse, domain-specialized expert.
  • Figure 2: (a) LLaMA2-7B Health Relative Performance vs. Sparsity for different methods. (b) LLaMA2-7B Legal Relative Performance vs. Sparsity for different methods. (c) One-shot Post-Pruning Performance on QA tasks for two domains. (d) Grassmann Distances Analyzing Task Differences vs. Domain Differences.
  • Figure 3: Projection energy across layers.
  • Figure 4: Projection energy by full name across domains.
  • Figure 5: Grassmann distance between LoRA adapters and weight matrices across layers.
  • ...and 2 more figures