Table of Contents
Fetching ...

APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao

TL;DR

Large language models incur substantial training and inference costs during fine-tuning. The paper proposes APT, a framework that adaptively prunes parameter blocks and tunes model capacity via dynamic APT adapters to improve both training and inference efficiency. It introduces an outlier-aware salience scoring mechanism and self-distillation to recover accuracy under aggressive pruning, and demonstrates substantial speedups and memory savings across RoBERTa, T5, and LLaMA2 models with minimal performance loss. This approach enables practical deployment of large LMs in resource-constrained settings while maintaining task performance at scale.

Abstract

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models' performance with 70% parameters remained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces large LMs memory training footprint by up to 70%.

APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

TL;DR

Large language models incur substantial training and inference costs during fine-tuning. The paper proposes APT, a framework that adaptively prunes parameter blocks and tunes model capacity via dynamic APT adapters to improve both training and inference efficiency. It introduces an outlier-aware salience scoring mechanism and self-distillation to recover accuracy under aggressive pruning, and demonstrates substantial speedups and memory savings across RoBERTa, T5, and LLaMA2 models with minimal performance loss. This approach enables practical deployment of large LMs in resource-constrained settings while maintaining task performance at scale.

Abstract

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models' performance with 70% parameters remained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces large LMs memory training footprint by up to 70%.
Paper Structure (32 sections, 11 equations, 5 figures, 12 tables)

This paper contains 32 sections, 11 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: provides both training and inference efficiency benefits by pruning and tuning pretrained LM parameters adaptively via the adapter. We dynamically adjust (add/reduce) adapter input/output dimensions and the rank ($r_{\text{apt}}$). Reducing adapter dimensions prunes frozen parameters, making training and inference faster and more memory-efficient. Adding adapter ranks helps recover the pruned LM's task performance. In contrast, existing adapters like LoRA allow efficient training but do not provide inference efficiency since the model size is not reduced.
  • Figure 2: adaptively identifies pruning and tuning parameters via APT adapters during fine-tuning with little cost. gradually prunes LM parameters with binary pruning masks learned from our lightweight outlier-aware salience scoring function for training and inference efficiency. also adds tuning parameters in salient layers in LM fine-tuning through increasing dynamic ranks in APT adapters for performance recovery.
  • Figure 3: Task performance v.s. relative inference efficiency on RoBERTa, T5, and LLaMA-2 7B models with and baselines.
  • Figure 4: The performance-efficiency tradeoff of compared to baseline methods. All metrics are normalized using LoRA tuning w/o pruning as the baseline. The circular dots with vertical axes on the left indicate training speed v.s. performance, with their sizes denoting the peak training memory usage. The squared dots with axes on the right indicate inference speedup v.s. performance, with sizes denoting inference memory usage.
  • Figure 5: Detailed analysis in with different initial, target sparsities, and adaptive tuning schedules.