Table of Contents
Fetching ...

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

Shengrui Li, Junzhe Chen, Xueting Han, Jing Bai

TL;DR

NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles, enhancing overall performance.

Abstract

The considerable size of Large Language Models (LLMs) presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles. This approach allows us to leverage numerous teachers with varying capacities to progressively guide the pruned model, enhancing overall performance. Extensive experiments across various tasks demonstrate the effectiveness of NutePrune. In LLaMA-7B zero-shot experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity. Our code is available at https://github.com/Lucius-lsr/NutePrune.

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

TL;DR

NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles, enhancing overall performance.

Abstract

The considerable size of Large Language Models (LLMs) presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles. This approach allows us to leverage numerous teachers with varying capacities to progressively guide the pruned model, enhancing overall performance. Extensive experiments across various tasks demonstrate the effectiveness of NutePrune. In LLaMA-7B zero-shot experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity. Our code is available at https://github.com/Lucius-lsr/NutePrune.
Paper Structure (35 sections, 11 equations, 3 figures, 18 tables)

This paper contains 35 sections, 11 equations, 3 figures, 18 tables.

Figures (3)

  • Figure 1: The advantage of our NutePrune. Left: Progressive distillation guides the student with teachers from easy to hard to avoid large capacity gap harming learning. But it suffers from multiple-fold costs of loading numerous teachers. Right: Our NutePrune leverages models with varying sparsity, enabling progressive distillation with negligible additional cost.
  • Figure 2: The overall framework of NutePrune. The pruned model is frozen and incorporated with learnable masks and LoRA. During pruning, the model is guided by numerous teachers. Before pruned to the target sparsity (e.g. 30%), it learns from teachers with a fixed capacity gap. Once the target sparsity is achieved, it continues to learn from all previous teachers from weak to strong. All these teachers are derived from snapshots of the student model itself. Since only the mask and LoRA modules are snapshotted, the additional memory cost is negligible.
  • Figure 3: Illustration of the sparsity of teacher and student models during pruning. Take the example with the target sparsity $t=50\%$ and sparsity gap $g=10\%$.